The Mindset We Value:
Investigator: Driven by intense curiosity ("Why did it fail?").
Empirical: Don't guess – find a way to get the data.
Observability Minded: Committed to using data and metrics to understand the system's behavior.
Patience: The fortitude required to tackle complex, deep-seated issues.
Pragmatic: Delivers results that move the needle.
Required Skills & Experience:
Coding & Scripting: Strong proficiency in Python, bash, and potentially Go, for automation (including experience with tools like teuthology).
Debugging & Tracing: Expertise with system internals (strace, lsof, /proc), tracing technologies (eBPF), and general debugging (gdb).
System & Resource Analysis: Ability to analyze system resources, including load, iowait, and zombie processes.
Storage & Networking: Deep understanding and debugging experience with storage concepts (S3, HTTP errors, POSIX, ACLs, flock, etc.) and networking (tcpdump, wireshark).
Domain Expertise: Familiarity with tools for breaking storage (fio, fsx, elbencho, xfs test suite) and understanding of distributed systems principles (experience with tools like Jepsen or Antithesis is a plus).
Code Reading: Reading ability in C/C++