xLab · UIUC · Prof. Tianyin Xu · Jun 2025 – Present

SREGym

AI Agents · Site Reliability Engineering · Systems Research · Benchmarking · Distributed Systems

Core team @ SREGym, a high-fidelity, interactive benchmark for AI-driven SRE, addressing the absence of production-fidelity evaluation environments for agentic systems. Prior benchmarks (AIOpsLab, ITBench) primarily target application-layer issues with single, clean faults injected into otherwise quiet environments. SREGym models real production complexity: (1) faults across the full stack, hardware, OS kernel, misoperation, application; (2) ambient noise from unrelated low-impact faults; (3) diverse failure modes including metastable behavior and correlated failures. Actively used by Microsoft Research, Resolve AI, and the University of Washington. Paper accepted at CAIS '26.

90 SRE Problems
50 Fault Injectors
CAIS '26 ACM Paper
4+ Orgs Using It

Fault Coverage Surface

The breadth of failure modes SREGym can simulate, from low-level disk sector errors all the way up to multi-fault retry storms. Each category is a composable primitive that can be stacked into compound problems.

Fault Taxonomy Primitives composable across the full system stack
github.com/SREGym/SREGym
Hardware Disk · CPU · Memory
dm-dust sector errors · Khaos eBPF syscall fail
OS Kernel Syscall Faults
Khaos eBPF intercept · errno injection
Misconfiguration App + Kubernetes
deploy.yaml · ConfigMap · operator spec
Application Bugs Buggy Code
microservice + operator code injection
Workload Stress Overload · Fail-slow
stress-ng · client load amplification
Ambient Noise Distractor Events
log/metric noise · zombie pods · churn
SREGym Distributed System Architecture

Publication

Peer-reviewed research at the 1st ACM Conference on AI and Agentic Systems.

CAIS '26 · 1st ACM Conference on Agentic and AI Systems · May 26, 2026 · San Jose, CA

SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills

Jackson Clark · Yiming Su · Saad Mohammad Rafid Pial · Lily Gniedziejko · Tianyin Xu

SREGym is a benchmark for AI-driven SRE that exposes a live cloud-native cluster where realistic failure scenarios are emulated through composable fault injectors, a substantial leap over AIOpsLab and ITBench, which top out at single, isolated faults in clean environments. The framework's extensible orchestrator unlocks three previously out-of-reach problem classes: low-level OS-kernel and hardware faults, multi-fault compound incidents, and noise-laden production-style environments where the root cause hides among unrelated disturbances. Three demonstration drills showcase the new capabilities: simulated bad disk sectors against MongoDB-backed Hotel Reservation, a self-sustaining metastable retry storm orchestrated across gRPC tunables and CPU pressure, and concurrent network + scheduler misconfigurations forcing the agent to triage by user-impact severity. Adopted by Microsoft Research, Resolve AI, and the University of Washington, Resolve AI's commercial observability controller deploys onto a SREGym cluster with a single kubectl command.

AI AgentsSREBenchmarking Distributed SystemsAIOpsKubernetes
Read Paper ↗
sregym-console, fleetcast · tidb-misconfiguration-drill
$kubectl get pods -n fleetcast
tidb-operator-bf4d8f9c7-xr2kp 0/1 ImagePullBackOff 3
fleetcast-satellite-app-backend-pxtwd 1/1 Running 0
fleetcast-satellite-app-backend-t7bpx 1/1 Running 0
[AGENT] tidb-operator not ready, satellite telemetry writes degrading
$mcp_prometheus.query('kube_pod_container_status_waiting{reason="ImagePullBackOff"}')
[PROM] tidb-operator · ImagePullBackOff · value=1 · namespace=fleetcast
$mcp_jaeger.get_traces(service="tidb-operator", limit=5)
[JAEGER] 0 traces returned, operator unreachable since 14:32:07 UTC
$kubectl describe pod tidb-operator-bf4d8f9c7-xr2kp -n fleetcast | grep Image
[AGENT] Image: pingcap/tidb-operatorr ← extra 'r', registry pull failure
[AGENT] Root cause: misconfigured image name in TiDB operator deployment
$kubectl set image deploy/tidb-operator manager=pingcap/tidb-operator:v1.5.3 -n fleetcast
[AGENT] tidb-operator 1/1 Running ✓, satellite telemetry writes resumed
$sregym.submit(diagnosis="Misconfigured image pingcap/tidb-operatorr in TiDB operator")

Problem Structure

Each SREGym problem defines deployed applications, fault injectors, and evaluation oracles in a single Python class, extensible by design so the community can contribute new failure scenarios.

problem.py, SREGym problem definition Python · Kubernetes · LangGraph
class K8STargetPortMisconfig(Problem):
    def __init__(self):
        self.app = SocialNetwork()                  # Application
        self.namespace = self.app.namespace
        super().__init__(self.app, self.namespace)

        # === Attach evaluation oracles ===
        self.root_cause = ("The service user-service has a misconfigured target port "
                          "(9999 instead of 9090), causing connection failures.")
        self.diagnosis_oracle  = LLMAsAJudgeDiagnosisOracle(problem=self, expected=self.root_cause)
        self.mitigation_oracle = TargetPortMisconfigMitigationOracle(problem=self)
        self.app.create_workload()

    @mark_fault_injected
    def inject_fault(self):                        # Faults
        injector = VirtualizationFaultInjector(namespace=self.namespace)
        injector.inject(
            fault_type="misconfig_k8s",
            microservice="user-service",
        )
        print(f"[FAULT INJECTED] {self.faulty_service} misconfigured")

Representative Problems

Three failure scenarios composed from SREGym's fault injectors, failure modes (hardware, metastable, concurrent) impossible to reproduce in AIOpsLab or ITBench. Each problem is built by orchestrating fault and noise primitives across the stack.

HW

Bad Sectors in Hard Disk Drives

Uses dm-dust to emulate bad disk sectors at arbitrary locations. Faults propagate as I/O system call failures when the Hotel Reservation application reads from MongoDB, no crash, only subtle storage degradation.

dm-dustHardwareDeathStarBenchMongoDB
META

Metastable Behavior

Sets aggressive gRPC configs (timeout 50ms, retry 30), saturates Hotel Reservation with 3,000 req/s via Blueprint, then triggers via CPU stress, all RPCs time out and retry at once, causing a self-sustaining retry storm with no crash symptoms.

BlueprintMetastablegRPCDeathStarBench
COMP

Concurrent Failures

Injects two simultaneous faults into a Social Network: a scheduler misconfiguration making an observability pod unschedulable, and a network misconfiguration failing user requests. Agent must triage by severity, network fault is user-visible; scheduler is internal.

Compound FaultsDeathStarBenchJaegerScheduler

Fault & Noise Injectors

A library of injection mechanisms spanning the full system stack, from fail-stop crashes to metastable behavior and noisy production environments. Includes Khaos, an SREGym-native eBPF tool simulating OS kernel and hardware faults at the syscall layer. Composable into 4,000+ fault-component scenarios.

MechanismSimulated Fault
Kill a process or podFail-stop behavior
stress-ng hardware stressFail-slow behavior
eBPF syscall intercept (Khaos)OS and hardware faults
dm-dust disk sector emulationSector errors in disk drives
Faulty deploy.yamlService mis-deployment
App config injectionApplication misconfiguration
Kubernetes config injectionKubernetes misconfiguration
Buggy application codeCode bugs
Buggy operator codeOperator bugs
Increase client loadsService overload
Inject noise into logs/metricsNoisy observability data
Create zombie resourcesExpired/stale resources
Schedule periodic maintenanceExpected cluster churn

Recognition

Press Siebel School · UIUC

Featured in Siebel School News

SREGym highlighted for its novel approach to evaluating AI agents on real production failures in live cloud environments.

Read Article ↗
Industry Adoption

Deployed at Microsoft Research, Resolve AI & University of Washington

Used by top researchers and commercial agentic SRE platforms.

Core Team · xLab · UIUC · Supervised by Prof. Tianyin Xu

Research Contributions

Independent contributions to SREGym's fault coverage, agent observability tooling, evaluation infrastructure, and benchmark automation, spanning Go, Python, and the full Kubernetes stack.

TiDB

Misoperation Fault Mechanism

Enabled misoperation as a new fault class in SREGym by porting TiDB and building a new application layer on top, expanding the benchmark's fault coverage into a previously unexplored category of database misuse failures.

TiDBFault InjectionDistributed DBBenchmark Design
MCP

Prometheus & Jaeger MCP Tools

Built Model Context Protocol tools for the LangGraph agent: a Prometheus MCP server with PromQL querying and Jaeger trace tools (get_traces, get_services, get_operations), so agents can query live metrics and distributed traces mid-incident loop.

MCPPrometheusJaegerLangGraphPromQL
viz

Agent Trace Visualizer

Built a tool that converts JSONL agent outputs into structured, timestamped HTML evaluation reports, making multi-step agent traces human-readable and dramatically faster to review for the research team.

PythonJSONLHTML GenerationEvaluation
TUI

Fault Injection TUI

Interactive terminal UI in Go using Bubble Tea for dynamic fault injection and application deployment, a keyboard-driven dashboard for triggering failure drills and managing cluster deployments without raw kubectl commands.

GoBubble TeaTUIFault Injection
E2E

Distributed E2E Benchmark Runner

Fully automated end-to-end testing tool that distributes SREGym problems across multiple nodes, handles cluster creation, dependency setup, and parallel execution via tmux. Problems are auto-divided per node so adding nodes directly reduces per-node load; logs are auto-collected back to local.

PythontmuxDistributedAutomationBenchmarking

Also Built at xLab

Infrastructure and tooling actively used by the xLab research group.

SREGym Target Application · xLab · Actively Used

FleetCast, Satellite Operations Simulator

  • TiDB-backed web application serving satellite orbital data, frontend on port 80, REST API on port 5000, served via nginx Ingress at orbital.local
  • Deployed via Helm (FleetCast/satellite-app chart) with a TiDB Operator + TiDB Cluster for persistence; Prometheus and Jaeger for observability; Locust for synthetic workload generation
  • Registered as a first-class SREGym application, primary target for the operator_misoperation problem family: 6 benchmark problems injecting K8s operator faults (wrong image, bad update strategy, invalid affinity, overloaded replicas, missing storage, security context) into the TiDB Operator
KubernetesTiDBHelmPrometheusJaegerLocust
Agent Tooling · xLab

Prometheus & Jaeger MCP Servers

  • Prometheus MCP server, exposes PromQL querying to LangGraph agents, enabling live metric analysis during incident resolution loops
  • Jaeger MCP server, implements get_traces, get_services, and get_operations so agents can correlate distributed trace data to root causes
  • JSONL → HTML trace visualizer, converts raw multi-step agent outputs into structured, timestamped HTML reports for the research team's evaluation workflow
MCPLangGraphPrometheusJaegerPython