Lily Gniedziejko, Projects

AI/SRE Engineering Intern, Ciroos

Summer 2026 · Incoming

ciroos.ai ↗

Joining Ciroos to work on their AI SRE Teammate, a multi-agent platform that investigates production incidents, correlates telemetry across observability, infra, and code domains, and drives autonomous remediation while keeping humans in the loop. .

AI AgentsSREMulti-Agent Systems ObservabilityIncident ResponseLLM Tooling Production AI

Automate

Reduces SRE toil through deep enterprise integrations across observability, ticketing, and runbook systems, agents handle routine investigation work end-to-end.

Augment

Rapid cross-domain data analysis: agents correlate metrics, logs, traces, and deployment state to surface root cause hypotheses faster than any single human dashboard sweep.

Autopilot

Autonomous anomaly investigation, agents proactively triage emerging incidents, gather evidence, and propose mitigation without waiting for a page to fire.

Why It Matters

Ciroos is translating frontier agentic-systems research into something on-call engineers actually trust at 3 AM, exactly the production deployment lens my SREGym work measures against.

Software Engineering Intern, Mueller Water Products

Built internal software tools for manufacturing operations, including a maintenance RAG chatbot, a Microsoft Teams SQL bot, machine data-entry workflows, and Power BI downtime dashboards for factory users.

ReactRAGAzurePythonChatbot DevelopmentPower BI

Maintenance Chatbot

Developed an internal RAG chatbot over 1,500+ technical PDFs, including 400+ page manuals and engineering drawings, to help maintenance teams find answers quickly and link directly to exact pages in source documents.

Microsoft Teams Bot

Created a Microsoft Teams bot that executes SQL queries and surfaces results through Power BI dashboards for faster operational reporting.

Machine Data Entry App

Implemented a full-stack data-entry application for Autopour and melting-machine workflows, helping standardize factory data collection.

Downtime Dashboard

Built a Power BI dashboard to visualize machine downtime with filters for machine, date range, and operational trends.

Research Intern, SREGym, XLab UIUC

06/2025–Present · with Prof. Tianyin Xu

Visit SREGym

SREGym is a live benchmark for evaluating AI Site Reliability Engineering agents on realistic production failures. Instead of testing agents on clean, single-fault scenarios, SREGym deploys real cloud-native systems on Kubernetes and creates high-fidelity failure drills using fault injection, observability data, and automated evaluation oracles.

The benchmark supports production-style environments across systems such as Kubernetes, TiDB, MongoDB, Kafka, and microservice applications. Agents diagnose and mitigate failures using tools like kubectl, Prometheus, Loki, and Jaeger, allowing researchers to evaluate whether AI agents can actually reason through incidents rather than just suppress alerts.

Impact, Paper accepted at ACM CAIS ’26 as SREGym: A Live Training Ground for AI SRE Agents with High-Fidelity Failure Drills
Adoption, Used by researchers and teams at Microsoft Research, Resolve AI, the University of Washington, and SRE startups
Scale, Includes 87 failure drills across realistic cloud and distributed-system environments
Fidelity, Models OS, hardware, Kubernetes, application, correlated, compound, noisy, and metastable failures

SREGymAI SREKubernetes Fault InjectionDistributed SystemsObservability PrometheusJaegerTiDB

What SREGym Is

A live training and evaluation ground where AI SRE agents must diagnose and mitigate real failures in running cloud systems.

Why It Matters

SREGym fills the gap between toy incident benchmarks and real production environments by testing agents on noisy, complex, high-fidelity failures.

Research Impact

Accepted at ACM CAIS ’26 and actively used by academic and industry researchers working on AI agents for incident response.

My Role

I contribute to the benchmark infrastructure, including Kubernetes-based target systems, TiDB fault scenarios, observability tooling, and agent evaluation workflows.

Internship Projects

AI/SRE Engineering Intern, Ciroos

Automate

Augment

Autopilot

Why It Matters

Software Engineering Intern, Mueller Water Products

Maintenance Chatbot

Microsoft Teams Bot

Machine Data Entry App

Downtime Dashboard

Research Intern, SREGym, XLab UIUC

What SREGym Is

Why It Matters

Research Impact

My Role

Projects

FleetCast – Satellite Operations Simulator

RSO Swiper & Research Lab Finder – React Native App

YouTube AI Assistant – Chrome Extension