Internship Projects

AI/SRE Engineering Intern, Ciroos

Summer 2026 · Incoming

Joining Ciroos to work on their AI SRE Teammate, a multi-agent platform that investigates production incidents, correlates telemetry across observability, infra, and code domains, and drives autonomous remediation while keeping humans in the loop. .

AI AgentsSREMulti-Agent Systems ObservabilityIncident ResponseLLM Tooling Production AI

Automate

Reduces SRE toil through deep enterprise integrations across observability, ticketing, and runbook systems, agents handle routine investigation work end-to-end.

Augment

Rapid cross-domain data analysis: agents correlate metrics, logs, traces, and deployment state to surface root cause hypotheses faster than any single human dashboard sweep.

Autopilot

Autonomous anomaly investigation, agents proactively triage emerging incidents, gather evidence, and propose mitigation without waiting for a page to fire.

Why It Matters

Ciroos is translating frontier agentic-systems research into something on-call engineers actually trust at 3 AM, exactly the production deployment lens my SREGym work measures against.

Mueller Water Products dashboard

Software Engineering Intern, Mueller Water Products

Built internal software tools for manufacturing operations, including a maintenance RAG chatbot, a Microsoft Teams SQL bot, machine data-entry workflows, and Power BI downtime dashboards for factory users.

ReactRAGAzurePythonChatbot DevelopmentPower BI

Maintenance Chatbot

Developed an internal RAG chatbot over 1,500+ technical PDFs, including 400+ page manuals and engineering drawings, to help maintenance teams find answers quickly and link directly to exact pages in source documents.

Microsoft Teams Bot

Created a Microsoft Teams bot that executes SQL queries and surfaces results through Power BI dashboards for faster operational reporting.

Machine Data Entry App

Implemented a full-stack data-entry application for Autopour and melting-machine workflows, helping standardize factory data collection.

Downtime Dashboard

Built a Power BI dashboard to visualize machine downtime with filters for machine, date range, and operational trends.

SREGym research project

Research Intern, SREGym, XLab UIUC

06/2025–Present · with Prof. Tianyin Xu

SREGym is a live benchmark for evaluating AI Site Reliability Engineering agents on realistic production failures. Instead of testing agents on clean, single-fault scenarios, SREGym deploys real cloud-native systems on Kubernetes and creates high-fidelity failure drills using fault injection, observability data, and automated evaluation oracles.

The benchmark supports production-style environments across systems such as Kubernetes, TiDB, MongoDB, Kafka, and microservice applications. Agents diagnose and mitigate failures using tools like kubectl, Prometheus, Loki, and Jaeger, allowing researchers to evaluate whether AI agents can actually reason through incidents rather than just suppress alerts.

  • Impact, Paper accepted at ACM CAIS ’26 as SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills
  • Adoption, Used by researchers and teams at Microsoft Research, Resolve AI, the University of Washington, and SRE startups
  • Scale, Includes 87 failure drills across realistic cloud and distributed-system environments
  • Fidelity, Models OS, hardware, Kubernetes, application, correlated, compound, noisy, and metastable failures
SREGymAI SREKubernetes Fault InjectionDistributed SystemsObservability PrometheusJaegerTiDB

What SREGym Is

A live training and evaluation ground where AI SRE agents must diagnose and mitigate real failures in running cloud systems.

Why It Matters

SREGym fills the gap between toy incident benchmarks and real production environments by testing agents on noisy, complex, high-fidelity failures.

Research Impact

Accepted at ACM CAIS ’26 and actively used by academic and industry researchers working on AI agents for incident response.

My Role

I contribute to the benchmark infrastructure, including Kubernetes-based target systems, TiDB fault scenarios, observability tooling, and agent evaluation workflows.

Projects

systems • agents • tooling

FleetCast Satellite Operations Simulator

FleetCast – Satellite Operations Simulator

TiDB-backed web application serving satellite orbital data, deployed in xLab as a live SREGym target environment. Frontend + REST API served via nginx Ingress; TiDB Operator manages persistence; Prometheus and Jaeger provide observability; Locust generates synthetic workloads. Primary target for SREGym's operator_misoperation problem family, 6 benchmark problems injecting K8s operator faults for AI agents to diagnose and mitigate.

KubernetesTiDBHelmPrometheusJaeger
RSO Swiper & Research Lab Finder

RSO Swiper & Research Lab Finder – React Native App

Developed a swipe-to-save app with Firebase authentication and OpenAI embeddings to filter RSO and lab cards. Scraped UIUC CS and RSO sites with BeautifulSoup and Cheerio to build the dataset.

React NativeFirebaseOpenAIWeb Scraping
YouTube AI Assistant Chrome Extension

YouTube AI Assistant – Chrome Extension

Published Chrome extension that overlays YouTube with AI-generated summaries and quizzes using scraped transcripts. Built with LangChain pipelines and the OpenAI API to help students study more efficiently.

Published.

Chrome ExtensionLangChainOpenAI API