In XLab’s Site Reliability Engineering research lab, I design and evaluate systems that make cloud infrastructure more resilient. I began by building observability tools that let agents interpret real metrics and traces, and I am now developing a benchmarking framework and fault injection experiments to study how autonomous agents can detect and mitigate failures in distributed databases and Kubernetes operators.
- FleetCast — TiDB-backed orbital simulator with Helm/Kubernetes deployments
- Observability Nodes — Prometheus + Jaeger integrations for LangGraph agents
- Summarizers — pipelines that condense traces, metrics, and LLM outputs
- Fault Work — injecting faults in K8s operators and building tools to mitigate them
KubernetesDistributed SystemsLangGraph
Site Reliability EngineeringAI agentsDocker
Agent Benchmark Harness
Tool-calling tasks with post-tool hooks; Prometheus/Jaeger metrics and traces.
Mission Control UI
Kubernetes & Docker–based orbital simulation app, fault injection, holographic controls.