Why the convergence is happening now

Three forces are collapsing the old boundaries between cloud engineering, SRE, and AI engineering.

First, AI is no longer a side project. It is a standard component of cloud-native delivery, the same way a database or a queue is. You integrate it, you monitor it, you respect its SLOs. A separate AI team operating at arm's length from the service owners is not viable when the AI is in the critical path of the user experience.

Second, the failure modes are shared. An LLM timeout looks, to a user, the same as a database timeout. A hallucinated output causes the same conversion drop as a broken UI. The customer does not know which team is responsible. Nor should they. One error budget, one on-call, one retrospective.

Third, the observability story is unifying. The signals that matter (traces, metrics, logs) now carry AI calls as first-class spans. The tools that already serve SRE teams are learning to speak model, prompt, token, and tool-call. Separate AI-observability stacks are being folded into the primary platform. That removes the last technical excuse for keeping the operating models apart.

What an AI SRE actually does

Day-to-day, the work looks like SRE with an extended surface area:

The skills that define the role

The best AI SREs I have met in 2026 share a specific profile. They came up through traditional SRE or platform engineering. They are comfortable with distributed systems, observability primitives, and incident command. Then they added: hands-on familiarity with at least one LLM provider at production scale, a working understanding of retrieval systems and embeddings, and enough evals literacy to sanity-check a regression claim without waiting for a data scientist.

They are not ML researchers. They do not need to be. The discipline is reliability engineering extended to a new class of components. What they do need is judgement about what can safely be automated, what needs human review, and where the coordination layer has to be stronger than the individual components. A theme I unpack in the agentic AI pillar.

What executives should ask for this year

Frequently asked questions

What is AI SRE?

AI SRE (AI Site Reliability Engineering) is the practice of applying SRE principles (SLOs, error budgets, toil reduction, blameless postmortems, unified observability) to AI-enabled services. It treats LLM calls, vector stores, prompt pipelines, and agent orchestration as first-class production components that need the same rigor as any other service tier.

How is AI SRE different from MLOps?

MLOps focuses on the lifecycle of training, deploying, and monitoring models. AI SRE focuses on the reliability of the user-facing service that uses those models. The two overlap at deployment and drift detection, but AI SRE extends beyond the model to include the application, the infrastructure, and the customer experience, all under a single error budget.

What skills does an AI SRE need?

Traditional SRE foundations (distributed systems, observability, incident response, capacity planning) combined with AI-era additions: understanding LLM cost and latency curves, working with vector databases and retrieval systems, evaluating agent behavior, and reasoning about non-deterministic outputs. The best AI SREs are strong-generalist engineers who know how to instrument the AI layer the same way they instrument a database.

Do we need a dedicated AI SRE team?

Most organizations do not need a separate team. They need AI competency inside existing SRE and platform teams. A dedicated team is justified at scale (thousands of AI-enabled workflows, strict regulatory requirements, or multi-model platforms). For everyone else, embedding AI SRE practices into the current reliability function is faster and avoids new organizational seams.

What SLOs should apply to an AI-enabled service?

Start with user-outcome SLOs: task success rate, time to resolution, customer-visible error rate. Add cost and latency SLOs per user journey, not per model call. Then add AI-specific SLOs where they matter: groundedness score, hallucination rate, guardrail trigger rate. Error budgets should be set against the user-outcome SLO; the others are diagnostic.