Research

I design and build systems at the intersection of Site Reliability Engineering, distributed systems, and AI agents. My work includes infrastructure automation, agent tooling, and simulators for fault tolerance research.

FleetCast

TiDB-backed orbital simulation platform to test system reliability under misoperations.

Built FleetCast, the lab’s orbital-pass simulation platform for injecting misoperation faults into distributed databases
Deployed as its own Helm application with Kubernetes pods and integrated workload generation for controlled experiments
Connected FleetCast to TiDB clusters managed by the TiDB Operator, creating a reproducible data pipeline for research
Developed TiDBClusterDeployer, an automation tool that provisions clusters, applies manifests, seeds schemas, and prepares datasets for FleetCast workloads
Enabled repeatable, realistic fault-injection experiments (operator misoperations) with telemetry streaming into TiDB
Provided a foundation for benchmarking system resilience and AI-Ops agents under controlled orbital-style workloads

Observability Tools for Agents

Built Prometheus and Jaeger integrations as reusable LangGraph nodes, enabling agents to reason over real observability data and drive autonomous SRE workflows.

Developed a Prometheus MCP Server and tools for querying live metrics with PromQL
Implemented Jaeger trace nodes (get_traces, get_services, get_operations) for fetching and analyzing distributed traces
Added summarization pipelines to transform raw metrics, traces, and operations into concise root-cause insights
Exposed observability data as reusable LangGraph nodes that agents can call during incident loops
Provided a unified framework that connects metrics + traces + LLM summarization into actionable, research-ready workflows

Summarization Tools

Designed summarization modules that distill raw observability data and LLM outputs into concise, actionable insights for agents.

Built trace summarizers for Jaeger, extracting key spans and condensing distributed traces into root-cause narratives
Developed metrics summarizers for Prometheus, converting noisy time-series into succinct performance and anomaly reports
Created LLM conversation summarizers that maintain agent context across long dialogues and incident loops

PostgreSQL Compilation Tool

Implemented a LangGraph node that automates compilation and provisioning of PostgreSQL servers, enabling agents to interact directly with complex database infrastructure.

Created a compile_postgresql_server tool that builds PostgreSQL from source and initializes a working instance
Automated the full lifecycle: configure, make, install, initialize (initdb), and launch a running server with a test database
Integrated with LangGraph state to dynamically select work directories and execution environments
Provides agents with a safe interface to compile, run, and query PostgreSQL during research workflows

Operator Misoperation Faults

Designed and implemented Kubernetes Operator misoperation faults to test the resilience of TiDB clusters under invalid configurations.

Built automated fault injections that introduce misoperations such as invalid tolerations and affinity rules
Contributed to more robust Kubernetes operator reliability through proactive resilience testing

My Contributions

FleetCast

Observability Tools for Agents

Summarization Tools

PostgreSQL Compilation Tool

Operator Misoperation Faults