ctaio.dev Ask AI Subscribe free

Engineering Metrics

Beyond DORA: Engineering Metrics for 2026

DORA metrics gave engineering leaders a shared language for software delivery performance. Deployment frequency, lead time, change failure rate, mean time to restore. Four numbers that made engineering legible to the business for the first time. But DORA was designed for a world where engineering teams shipped deterministic software through CI/CD pipelines. In 2026, half your teams are shipping probabilistic AI systems, your best engineers are spending 30% of their time fighting tooling friction, and the board wants to know about business impact, not deployment counts. DORA is not wrong. It is incomplete.

By · Published May 25, 2026

What DORA Gets Right

Before critiquing DORA, it is worth acknowledging what it achieved. Before the Accelerate book in 2018 and the subsequent State of DevOps Reports, engineering organizations had no standardized vocabulary for delivery performance. Every company measured different things. Benchmarking was impossible. When a CTO said "we ship fast," there was no way to verify or compare.

DORA fixed that. The four metrics are well-defined, measurable with existing tooling, and validated against organizational performance through years of survey research. The classification system (Elite, High, Medium, Low) gives teams a benchmark that is directionally useful even if the exact cutoffs are debatable.

Deployment frequency measures throughput: how often does the team get changes into production? Elite teams deploy on demand, multiple times per day. Low performers deploy between once per month and once every six months. The gap is typically explained by CI/CD maturity, test automation, and organizational trust.

Lead time for changes measures velocity: how long from code committed to code running in production? Elite teams measure this in under one day. The metric captures pipeline friction, review bottlenecks, and deployment ceremony. It is genuinely useful for identifying process problems.

Change failure rate measures quality: what percentage of changes to production result in degraded service requiring remediation? Elite teams keep this below 5%. This is the balancing metric that prevents teams from optimizing deployment frequency at the expense of stability.

Mean time to restore (MTTR) measures resilience: when something breaks, how quickly does the team fix it? Elite teams restore within one hour. This metric rewards investment in monitoring, on-call processes, and incident management.

Together, these four metrics form a balanced scorecard for software delivery. Speed without quality is reckless. Quality without speed is stagnation. DORA captures both tensions in four numbers that fit on one slide. That is genuinely valuable, and abandoning it entirely would be a mistake.

Where DORA Falls Short

DORA's limitations are not bugs in the framework. They are scope boundaries. DORA measures software delivery performance. It does not measure engineering effectiveness, developer experience, business impact, or strategic alignment. The problem is not DORA. The problem is treating DORA as a complete measurement system when it was always designed to be one input among several.

1. Developer Experience Is Invisible

A team can have Elite DORA metrics and miserable developers. Fast deployment pipelines do not mean developers enjoy their work. Lead time can look great while engineers spend 40% of their time fighting flaky tests, waiting for environments to spin up, or wrestling with a codebase that nobody documented. DORA measures the pipeline. It does not measure the human beings feeding the pipeline. Developer attrition, satisfaction, and cognitive load sit entirely outside the DORA frame.

This matters financially. Replacing a senior engineer costs 6-9 months of salary in recruiting, onboarding, and ramp-up productivity loss. A team with Elite DORA metrics and 30% annual attrition is not high-performing. It is burning engineers to maintain the metrics, which is unsustainable.

2. Business Impact Is Assumed, Not Measured

DORA's research shows that high-performing teams correlate with better organizational outcomes. But correlation at the population level does not mean your specific team's deployment frequency is causing better business results. A team deploying 50 times per day on a feature nobody uses is not creating value. A team deploying twice a week on the revenue-critical checkout flow might be creating enormous value.

The missing link is impact attribution. Which deployments moved business metrics? Which features justified their engineering cost? DORA measures throughput and stability but not whether the throughput produced anything the business needed. For boards and executives, this gap makes DORA data interesting but not actionable.

3. AI and ML Workflows Break the Model

DORA's four metrics assume a specific workflow: code is written, reviewed, merged, deployed, and either works or fails. AI engineering does not follow this pattern. A model fine-tuning run takes weeks and produces no "deployments" until evaluation completes. A prompt engineering cycle might produce 30 "deployments" (prompt versions) in a day, each of which is trivial in isolation but part of a larger optimization effort.

Change failure rate breaks entirely on probabilistic systems. A new model version is never binary pass/fail. It improves accuracy on English inputs by 4% while degrading multilingual performance by 1.5%. Is that a failure? DORA cannot answer because DORA assumes deterministic outcomes.

The 2024 and 2025 State of DevOps Reports acknowledged this gap but offered no framework for addressing it. In 2026, with most engineering organizations running at least one AI team, this is not a niche concern. It is a fundamental measurement gap.

4. Platform and Infrastructure Teams Are Poorly Served

A platform team's "customers" are internal developers. Their "deployments" might be library releases, API version bumps, or infrastructure provisioning changes. Deployment frequency for an internal platform team is meaningless because the team might intentionally limit deployment frequency to avoid breaking consumers. Lead time is distorted by internal negotiation cycles that do not exist in product engineering.

What matters for platform teams: internal adoption rate, developer satisfaction with the platform, self-service completion rate, and time-to-productivity for new engineers onboarding onto the platform. None of these map to DORA's four metrics.

5. Strategic Alignment Is Out of Scope

DORA tells you whether engineering is shipping fast and reliably. It does not tell you whether engineering is shipping the right things. A team with Elite DORA metrics working on features that do not align with company strategy is a well-oiled machine pointed in the wrong direction. Strategic alignment, measured as the percentage of engineering effort allocated to the company's top priorities, is arguably more important than delivery speed.

The SPACE Framework

SPACE was published in 2021 by Nicole Forsgren (also behind DORA), Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. It came out of research at GitHub and Microsoft, informed by the observation that developer productivity is multi-dimensional and cannot be captured by any single metric or pair of metrics.

The acronym represents five dimensions:

S

Satisfaction and Wellbeing

How fulfilled and healthy developers feel about their work, team, tools, and culture. Measured through surveys: job satisfaction scores, willingness to recommend the team, burnout indicators. This is the dimension most engineering organizations skip, and it is among the strongest predictors of retention. Teams that score consistently low on satisfaction bleed engineers, and no amount of compensation reliably stops it once people have decided the day-to-day is broken.

P

Performance

The outcomes of work, not the volume. Quality of code delivered, impact on customer satisfaction, reliability of systems, business value produced. This is explicitly not "did they ship a lot" but "did what they shipped matter and work correctly." Measured through outcome metrics: customer satisfaction changes, defect rates, business KPI impact.

A

Activity

Observable actions: commits, code reviews, deployments, documents written, on-call incidents handled. Activity metrics are the easiest to collect and the most dangerous to use in isolation. High activity with low performance means busy work. Low activity with high performance means efficient work. Activity only becomes meaningful when paired with the other four dimensions.

C

Communication and Collaboration

How effectively people and teams work together. Code review turnaround time, cross-team coordination friction, knowledge sharing patterns. In practice, teams with fast code review cycles (under 4 hours for first response) ship noticeably faster than teams where reviews sit for a day or more, because every blocked review stalls the next piece of work behind it. This dimension captures the organizational friction that DORA misses entirely.

E

Efficiency

The ability to complete work with minimal waste and friction. Build times, environment provisioning speed, time spent on toil versus creative work, interruption frequency. A team that spends 35% of its time waiting for builds, fighting configuration issues, and attending status meetings has an efficiency problem that no amount of deployment frequency improvement will fix.

SPACE in Practice

The framework's strength is its completeness. Its weakness is implementation complexity. You cannot instrument SPACE overnight. Satisfaction requires surveys. Performance requires outcome attribution. Communication metrics require collaboration tool instrumentation. Most organizations that attempt full SPACE implementation end up measuring 2-3 dimensions well and the other 2-3 poorly.

My recommendation: start with Satisfaction (quarterly survey, 10 questions, takes 30 days to set up) and Efficiency (build times and toil tracking, instrumentable from existing CI/CD data). These two dimensions surface problems that DORA misses and are actionable without complex attribution models.

The DX (Developer Experience) Framework

DX was formalized by Abi Noda, Margaret-Anne Storey, and Nicole Forsgren in a 2023 paper, "DevEx: What Actually Drives Productivity". Where SPACE provides a broad model with five dimensions, DX deliberately narrows focus to three dimensions that the researchers found most predictive of developer productivity and satisfaction.

Feedback Loops

How quickly developers get information about their code. CI build time is the canonical example: a team with 3-minute builds iterates 5-10x faster than a team with 45-minute builds. But feedback loops extend beyond CI. How fast does a code review come back? How quickly can a developer see their change running in a staging environment? How long does it take to get answers from another team about an API contract?

Rough working benchmarks: the fastest teams I work with keep CI times under 10 minutes, first code review response under 4 hours, and staging deployment under 15 minutes. The slowest sit at CI over 30 minutes, first review response over 24 hours, and staging deployment over 2 hours. The gap between those two profiles is enormous in day-to-day developer experience.

Cognitive Load

The mental effort required to complete tasks. High cognitive load comes from poorly documented systems, inconsistent tooling, sprawling microservice architectures with unclear ownership, and the constant context-switching of modern development. A developer who needs to consult three wikis, two Slack channels, and a tribal-knowledge holder to understand how to make a change is carrying unnecessary cognitive load.

Measurement: DX uses survey questions ("How easy is it to understand the codebase you work with?", "How often do you need to consult others to complete routine tasks?") scored on a 7-point scale. Teams scoring below 4 on cognitive load questions typically have 2x the cycle time of teams scoring above 5, because every task requires extra investigation before implementation can begin.

Flow State

The ability to enter and sustain focused, productive work. Flow state requires uninterrupted blocks of time (typically 2+ hours), clear task definitions, and minimal administrative overhead. The primary destroyers of flow state in engineering organizations: meetings that fragment the day into sub-2-hour blocks, Slack/Teams notification culture that expects real-time response, and unclear priorities that force developers to context-switch between tasks.

Actionable benchmark: developers who report 4+ hours of uninterrupted coding time per day score 2x higher on self-reported productivity than those with less than 2 hours. The intervention is structural (meeting-free afternoons, async-first communication policies, clear sprint commitments) rather than individual.

DX vs SPACE: When to Use Which

DX is more actionable. Each dimension maps directly to specific interventions: slow feedback loops have an engineering fix (speed up CI, parallelize tests), high cognitive load has a process fix (better documentation, simpler architecture), and broken flow state has an organizational fix (meeting policy, notification culture). SPACE is more comprehensive but requires more organizational maturity to act on all five dimensions simultaneously.

For engineering organizations with fewer than 100 engineers, DX is usually the better starting point. It is simpler to implement, produces actionable insights faster, and the three dimensions cover the friction that most mid-size teams struggle with. For organizations above 200 engineers, SPACE's additional dimensions (especially Communication and Collaboration) become critical because cross-team friction dominates at scale.

Framework Comparison

Here is how the major frameworks stack up across the dimensions that matter to engineering leaders.

Dimension DORA SPACE DX
Delivery speed Strong (deployment frequency, lead time) Partial (Activity dimension) Indirect (feedback loops speed delivery)
Delivery stability Strong (change failure rate, MTTR) Partial (Performance dimension) Not covered
Developer satisfaction Not covered Strong (Satisfaction dimension) Indirect (flow state correlates with satisfaction)
Developer productivity friction Not covered Strong (Efficiency dimension) Strong (cognitive load, feedback loops)
Business impact Not covered Partial (Performance dimension) Not covered
Team collaboration Not covered Strong (Communication dimension) Not covered
AI/ML team fit Poor (assumes deterministic workflows) Moderate (multi-dimensional helps) Moderate (cognitive load applies to AI work)
Ease of implementation High (4 metrics, instrumentable from CI/CD) Low (5 dimensions, surveys + instrumentation) Medium (3 dimensions, survey-driven)
Industry benchmarks available Yes (annual State of DevOps Report) Limited (no standard benchmark set) Yes (via DX platform, since 2024)
Tooling ecosystem Strong (LinearB, Sleuth, Jellyfish, Faros AI) Weak (no dedicated tooling) Moderate (DX platform, Pluralsight Flow)

Emerging Metrics for the AI Era

None of the existing frameworks were designed for engineering organizations where 20-40% of the team is building AI systems. The gap is not just about AI team metrics (covered in our AI team metrics guide). It is about how AI tooling is changing the work patterns of every engineer, including those on traditional platform and product teams.

AI-Assisted Development Metrics

By mid-2026, most engineering teams use AI coding assistants (GitHub Copilot, Cursor, Cline, Claude Code, Windsurf). This creates new measurement needs:

  • AI suggestion acceptance rate: What percentage of AI-generated code suggestions does the team accept? A rate below 15% suggests the AI tool is poorly configured for your codebase. Above 50% suggests over-reliance that may introduce subtle bugs. The sweet spot for most teams is 25-40%.
  • AI-assisted vs manual defect rate: Do code sections written with AI assistance have higher or lower defect rates than manually-written code? The pattern most teams report is that AI-assisted code holds up fine when review stays rigorous, and degrades when reviewers start trusting plausible-looking output without scrutiny.
  • Time saved per task category: Where does AI assistance actually save time? Boilerplate generation and test writing show the largest gains (30-50% time reduction). Complex architectural decisions and debugging show minimal gains and sometimes negative gains when the AI suggestion sends the developer down the wrong path.

AI Team-Specific Metrics

For teams building AI features (not just using AI tools), the measurement framework shifts fundamentally:

  • Experiment velocity: Experiments run per sprint, conversion to production. Target 8-15 experiments per sprint for a 5-person team, 20-30% conversion rate.
  • Model quality envelope: Multi-dimensional quality tracking (accuracy, latency, cost, fairness) with defined acceptable ranges rather than single-metric optimization.
  • Inference cost trajectory: Cost per request over time. Should be declining quarter-over-quarter as the team optimizes prompts, caching, model selection, and batching.
  • Eval suite maturity: Coverage and pass rate of automated evaluation suites. The AI equivalent of test coverage. Target: eval coverage on 100% of production prompts and models, pass rate stable or improving.

Cross-Cutting Metrics for Hybrid Organizations

Most engineering organizations in 2026 run a mix of traditional software teams and AI teams. You need metrics that work across both:

  • Time-to-value: Days from idea to measurable business impact. Works for both a traditional feature (idea to revenue-attributed deploy) and an AI feature (experiment to production with measured user impact). The common denominator is business impact, which removes the apples-to-oranges problem of comparing deployment counts.
  • Cost per business outcome: Total team cost divided by business outcomes delivered. For product teams, outcomes might be features with measured user adoption. For AI teams, outcomes might be automation cost savings or AI feature revenue. For platform teams, outcomes might be developer hours saved through tooling improvements.
  • Developer experience score: Quarterly survey covering the DX dimensions (feedback loops, cognitive load, flow state) that applies equally to all team types. A platform engineer frustrated by 45-minute build times and an AI engineer frustrated by 3-hour training runs are both experiencing broken feedback loops.

Building Your Measurement Stack

Do not adopt a framework wholesale. Build a measurement stack that answers your specific questions by combining elements from multiple frameworks. Here is the process I use with CTOs:

Step 1: Identify Your Questions (Week 1)

Every measurement system should answer specific questions. Write them down. Common examples:

  • Are we shipping what we committed to? (Delivery predictability)
  • Is our engineering investment efficient? (Cost per outcome)
  • Are our developers productive and happy? (DX + satisfaction)
  • Is production stable? (Reliability)
  • Are our AI investments paying off? (AI-specific business impact)

Limit yourself to 4-6 questions. Each question maps to 1-2 metrics. More than 12 total metrics means nobody will track any of them consistently.

Step 2: Instrument What You Have (Weeks 2-4)

Most organizations already collect 60-70% of the data they need but do not surface it. Check these sources:

  • CI/CD pipeline: Deployment frequency, lead time, build times, test pass rates. Tools: GitHub Actions, GitLab CI, Jenkins, CircleCI all expose these natively.
  • Incident management: Incident count, MTTR, severity distribution. Tools: PagerDuty, Opsgenie, incident.io, Rootly.
  • Project tracking: Cycle time, throughput, scope changes. Tools: Jira, Linear, Shortcut all have analytics views.
  • Observability: Uptime, latency percentiles, error rates. Tools: Datadog, Grafana, New Relic.

Step 3: Add Survey Data (Month 2)

The dimensions that matter most (satisfaction, cognitive load, flow state) require survey data. Run a 15-question quarterly survey covering:

  • Overall job satisfaction (1-10)
  • Tooling satisfaction (1-10)
  • Cognitive load: "How easy is it to understand the systems you work with?" (1-7)
  • Flow state: "How many hours of uninterrupted coding time do you get per day?" (numeric)
  • Collaboration: "How quickly do code reviews come back?" (hours)
  • One open-ended: "What is the biggest source of friction in your daily work?"

Response rate target: 70%+. Below that, selection bias makes the data unreliable. Anonymize responses to get honest input. Use a dedicated survey tool (Culture Amp, Officevibe, DX platform), not a Google Form buried in Slack.

Step 4: Set Baselines, Then Targets (Month 3)

Do not set targets before you have baseline data. Your first quarter is measurement-only. Second quarter, set targets collaboratively with teams based on their own baseline. "Your cycle time baseline is 24 days. Can we get it to 18 by end of Q3?" is actionable. "Industry benchmark is 14 days, get there by next quarter" is demoralizing and ignores your specific context.

Step 5: Review Cadence

Three rhythms, three audiences:

Cadence Audience Metrics Format
Weekly Engineering managers + teams Cycle time, throughput, build health, incident status Shared dashboard, reviewed in team retros
Monthly VP Engineering / CTO Delivery predictability, DX survey trends, cost per outcome, AI metrics 30-minute review meeting with commentary
Quarterly Board / C-suite Commitment reliability, uptime, eng cost / revenue, AI business impact One-page slide with trends and forward-looking indicators

Common Implementation Mistakes

Having helped engineering organizations adopt metrics frameworks over the past several years, these are the failure patterns I see repeatedly:

Measuring individuals instead of teams

The moment you rank individual engineers by commit count, review speed, or lines of code, you create a culture of gaming and competition that destroys collaboration. An engineer who spends a day helping three teammates unblock is more valuable than one who writes 500 lines of code in isolation, but individual metrics cannot capture that. Measure teams. Review individuals through 1-on-1s and peer feedback, not dashboards.

Adopting too many metrics at once

Engineering organizations that try to implement full SPACE (5 dimensions, 15+ metrics) in one quarter end up measuring nothing well. Start with 3-4 metrics you can instrument reliably. Add dimensions as you build the habit of reviewing and acting on data. A team that religiously tracks cycle time, developer satisfaction, and incident count will outperform a team that half-tracks 20 metrics.

Setting targets before baselines

"Our deployment frequency should be daily." Based on what? If your current deployment frequency is weekly and your architecture requires a 4-hour integration test suite, daily deploys require infrastructure investment, not just a target. Always collect 8-12 weeks of baseline data before setting targets. Targets without baselines are aspirations, not management.

Using metrics as a stick

When metrics become a performance management tool rather than a diagnostic tool, teams game them. I watched an organization where deployment frequency became a team KPI. Teams started splitting changes into tiny, trivial deployments to hit the target. Deployment frequency tripled. Value delivery did not change. Use metrics to identify friction and improvement opportunities, not to rank or punish teams.

Ignoring the qualitative

Numbers without narratives mislead. Cycle time increased from 14 to 22 days last quarter. Bad? Maybe the team took on a complex migration that was strategically necessary and inherently slower. Numbers tell you what changed. Conversations with team leads tell you why. Always pair quantitative dashboards with qualitative commentary.

Related Guides

DORA Alternatives: Frequently Asked Questions

Are DORA metrics still relevant in 2026?
Yes, but insufficient on their own. DORA still provides the best standardized benchmark for software delivery performance: deployment frequency, lead time for changes, change failure rate, and mean time to restore. These remain valid for deterministic software systems. The problem is scope, not accuracy. DORA measures one dimension (delivery throughput and stability) when engineering leaders need visibility across developer experience, business impact, strategic alignment, and AI/ML-specific workflows. Use DORA as your delivery baseline and supplement with SPACE or DX for the dimensions DORA cannot see.
What is the SPACE framework?
SPACE is a developer productivity framework created by researchers at GitHub, Microsoft, and the University of Victoria, published in 2021. It measures five dimensions: Satisfaction and wellbeing, Performance (outcome quality, not output volume), Activity (observable actions like commits, reviews, deployments), Communication and collaboration (how effectively teams work together), and Efficiency (ability to complete work with minimal friction). The key insight is that no single dimension captures productivity. A developer can be highly active (many commits) but inefficient (fighting bad tooling) and unsatisfied (burning out). SPACE forces multi-dimensional measurement.
What is the DX framework and how does it differ from SPACE?
DX (Developer Experience) is a framework by Abi Noda, Margaret-Anne Storey, and Nicole Forsgren (2023) that focuses on three core dimensions: feedback loops (how quickly developers get information about their code), cognitive load (the mental effort required to complete tasks), and flow state (the ability to work without interruption). DX differs from SPACE by being more actionable: each dimension maps directly to specific interventions. Long feedback loops? Speed up CI. High cognitive load? Simplify architecture. Broken flow state? Fix the meeting culture. DX is survey-driven, typically measured quarterly, with benchmarks available through the DX platform.
How do you measure engineering productivity without gaming?
Three principles reduce gaming risk. First, measure outcomes at the team level rather than individual activity. Teams cannot game cycle time as easily as individuals can game lines of code. Second, use composite metrics that balance competing incentives: deployment frequency paired with change failure rate prevents teams from shipping fast and breaking things. Third, include qualitative dimensions (developer satisfaction surveys, peer assessments) that resist quantitative gaming. The strongest signal against gaming is measuring business impact: revenue from features, cost savings from automation, customer satisfaction changes. These cannot be faked with process tricks.
What metrics should replace DORA for AI and ML teams?
AI teams need three metric categories that DORA does not cover. Experiment velocity: experiments run per sprint, experiment-to-production conversion rate (target 20-30%), and time from first experiment to production deployment. Model quality: hallucination rate, inference latency p95, model drift detection, and eval suite regression rate. Business value: AI feature adoption among eligible users, cost savings from AI automation measured against a control group, and inference cost per transaction trending down quarter-over-quarter. Keep using DORA for the deterministic parts of your AI stack (APIs, data pipelines, infrastructure) and use these AI-specific metrics for the probabilistic parts.
How do you implement a new metrics framework without disrupting teams?
Phase it over 90 days. Month one: instrument what you already have. Most teams are already collecting deployment data, incident data, and CI metrics but not surfacing them consistently. Build a shared dashboard with existing data. Month two: add the missing dimensions. For SPACE, add a developer satisfaction survey. For DX, run the first feedback loops and cognitive load assessment. Do not set targets yet. Collect baselines. Month three: set targets collaboratively with teams based on their own baseline data, not industry benchmarks. Teams that participate in setting their own targets are 3x more likely to improve on them than teams who receive targets from management.
·
Thomas Prommer
Thomas Prommer Technology Executive — CTO/CIO/CTAIO

These salary reports are built on firsthand hiring experience across 20+ years of engineering leadership (adidas, $9B platform, 500+ engineers) and a proprietary network of 200+ executive recruiters and headhunters who share placement data with us directly. As a top-1% expert on institutional investor networks, I've conducted 200+ technical due diligence consultations for PE/VC firms including Blackstone, Bain Capital, and Berenberg — work that requires current, accurate compensation benchmarks across every seniority level. Our team cross-references recruiter data with BLS statistics, job board salary disclosures, and executive compensation surveys to produce ranges you can actually negotiate with.