SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills
SREGym is a benchmark for AI-driven SRE that exposes a live cloud-native cluster where realistic failure scenarios are emulated through composable fault injectors, a substantial leap over AIOpsLab and ITBench, which top out at single, isolated faults in clean environments. The framework's extensible orchestrator unlocks three previously out-of-reach problem classes: low-level OS-kernel and hardware faults, multi-fault compound incidents, and noise-laden production-style environments where the root cause hides among unrelated disturbances. Three demonstration drills showcase the new capabilities: simulated bad disk sectors against MongoDB-backed Hotel Reservation, a self-sustaining metastable retry storm orchestrated across gRPC tunables and CPU pressure, and concurrent network + scheduler misconfigurations forcing the agent to triage by user-impact severity. Adopted by Microsoft Research, Resolve AI, and the University of Washington, Resolve AI's commercial observability controller deploys onto a SREGym cluster with a single kubectl command.