I design and build systems at the intersection of Site Reliability Engineering, distributed systems, and AI agents. My work includes infrastructure automation, agent tooling, and simulators for fault tolerance research.
FleetCast
TiDB-backed orbital simulation platform to test system reliability under misoperations.
- Built FleetCast, the lab’s orbital-pass simulation platform for injecting misoperation faults into distributed databases
- Deployed as its own Helm application with Kubernetes pods and integrated workload generation for controlled experiments
- Connected FleetCast to TiDB clusters managed by the TiDB Operator, creating a reproducible data pipeline for research
- Developed TiDBClusterDeployer, an automation tool that provisions clusters, applies manifests, seeds schemas, and prepares datasets for FleetCast workloads
- Enabled repeatable, realistic fault-injection experiments (operator misoperations) with telemetry streaming into TiDB
- Provided a foundation for benchmarking system resilience and AI-Ops agents under controlled orbital-style workloads
Observability Tools for Agents
Built Prometheus and Jaeger integrations as reusable LangGraph nodes, enabling agents to reason over real observability data and drive autonomous SRE workflows.
- Developed a Prometheus MCP Server and tools for querying live metrics with PromQL
- Implemented Jaeger trace nodes (get_traces, get_services, get_operations) for fetching and analyzing distributed traces
- Added summarization pipelines to transform raw metrics, traces, and operations into concise root-cause insights
- Exposed observability data as reusable LangGraph nodes that agents can call during incident loops
- Provided a unified framework that connects metrics + traces + LLM summarization into actionable, research-ready workflows
Summarization Tools
Designed summarization modules that distill raw observability data and LLM outputs into concise, actionable insights for agents.
- Built trace summarizers for Jaeger, extracting key spans and condensing distributed traces into root-cause narratives
- Developed metrics summarizers for Prometheus, converting noisy time-series into succinct performance and anomaly reports
- Created LLM conversation summarizers that maintain agent context across long dialogues and incident loops
PostgreSQL Compilation Tool
Implemented a LangGraph node that automates compilation and provisioning of PostgreSQL servers, enabling agents to interact directly with complex database infrastructure.
- Created a compile_postgresql_server tool that builds PostgreSQL from source and initializes a working instance
- Automated the full lifecycle: configure, make, install, initialize (initdb), and launch a running server with a test database
- Integrated with LangGraph state to dynamically select work directories and execution environments
- Provides agents with a safe interface to compile, run, and query PostgreSQL during research workflows
Operator Misoperation Faults
Designed and implemented Kubernetes Operator misoperation faults to test the resilience of TiDB clusters under invalid configurations.
- Built automated fault injections that introduce misoperations such as invalid tolerations and affinity rules
- Contributed to more robust Kubernetes operator reliability through proactive resilience testing