SRE Newsletter

Reliability & Operations Weekly

Every Wednesday, practical stuff for site reliability engineers. SLO design, on-call management, incident response, chaos engineering, and keeping systems reliable at scale.

Reliability Engineering at Scale

SRE has gone from specialist discipline to core competency. If you run production infrastructure, the SRE mindset -- measuring reliability, budgeting for failure, automating remediation, building resilience in -- is expected of your team now. This newsletter covers what reliability teams actually deal with: designing SLOs that reflect business priorities, managing on-call without burning people out, testing failure through chaos engineering, and running incident response that prevents the same problems from coming back.

SLOs: The Bridge Between Engineering and Product

SLOs are not about measuring uptime. They define the reliability contract between engineering and the business. A 99.9% SLO gives you roughly 43 minutes of error budget per month. When that budget runs out, you stop shipping features and focus on stability. It is a disciplined way to trade reliability against velocity.

Teams that have made SLOs work tied them to business outcomes, not arbitrary numbers. Your payment system SLO should be higher than your SLO for a non-critical feature. SLOs should reflect what users need, not what your infrastructure can technically achieve. And your error budget should drive your release schedule -- when budget is running low, slow down feature work until things stabilize.

On-Call: Designing Schedules That Do Not Burn People Out

Most on-call schedules are badly designed. Engineers rotate through weeks where they are never more than a few minutes from an interrupt. Sleep gets wrecked, personal plans get cancelled, and half the alerts are low-severity noise that could have been handled differently. The result is burnout and turnover in reliability teams.

The rotations that work have clear escalation paths (sev-1 goes to on-call, sev-2 waits until morning), reasonable interrupt budgets (pages are rare enough that false alerts feel notable), and real follow-up (every incident gets a blameless post-mortem and remediation). They also invest in tooling that makes on-call less painful: paging systems with good integrations, runbooks that stay current, and automation that keeps low-severity issues from reaching a human.

Chaos Engineering: Testing Failure Before Production

Chaos engineering means intentionally breaking systems in production (or staging) to learn how they fail before real outages teach you the hard way. Tools like Gremlin, Litmus, and custom fault-injection frameworks let you test network latency, service failures, data corruption, and infrastructure unavailability in a controlled way.

Teams that practice this have fewer surprises during real incidents because they have already seen most failure modes. They know which systems are actually resilient and which have hidden dependencies. They know blast radius and recovery time for each class of failure. The investment -- both tooling and engineering cycles -- pays for itself in reduced incident severity and faster recovery.

Incident Response: Learning From Failure

Every incident is a chance to learn. Most organizations waste it. They fix the immediate problem, maybe send a message to stakeholders, and move on. A few months later, the same thing happens again.

Reliability teams that have broken this cycle run blameless post-mortems after every significant incident. The point is not blame. It is understanding what happened, finding the root cause, and designing systems and processes that prevent it from recurring. Good post-mortems also capture what went right -- the automation that prevented escalation, the people who coordinated the response well. Reinforcing what works is just as valuable as fixing what failed.

RECENT ISSUES

Mar 13, 2026

SLOs and Error Budgets: Making Reliability a Business Decision

SLOs are not just about uptime. They are the contract between engineering and product about what level of reliability the business actually needs and can afford.

Read issue →

Mar 6, 2026

On-Call Rotations That Do Not Destroy Engineer Happiness

Most on-call schedules burn out engineers. The ones that work have clear escalation paths, reasonable interrupt budgets, and accountability for remediation over firefighting.

Read issue →

Feb 27, 2026

Chaos Engineering: Testing Failure Before It Happens

Gremlin, Litmus, and fault-injection frameworks let you intentionally break systems in production. The teams doing this have fewer surprises and faster recovery times.

Read issue →

Reliability & Operations Weekly

Reliability Engineering at Scale

SLOs: The Bridge Between Engineering and Product

On-Call: Designing Schedules That Do Not Burn People Out

Chaos Engineering: Testing Failure Before Production

Incident Response: Learning From Failure

Reliability insights every Wednesday

SLOs and Error Budgets: Making Reliability a Business Decision

On-Call Rotations That Do Not Destroy Engineer Happiness

Chaos Engineering: Testing Failure Before It Happens