What DORA Gets Right
Before critiquing DORA, it is worth acknowledging what it achieved. Before the Accelerate book in 2018 and the subsequent State of DevOps Reports, engineering organizations had no standardized vocabulary for delivery performance. Every company measured different things. Benchmarking was impossible. When a CTO said "we ship fast," there was no way to verify or compare.
DORA fixed that. The four metrics are well-defined, measurable with existing tooling, and validated against organizational performance through years of survey research. The classification system (Elite, High, Medium, Low) gives teams a benchmark that is directionally useful even if the exact cutoffs are debatable.
Deployment frequency measures throughput: how often does the team get changes into production? Elite teams deploy on demand, multiple times per day. Low performers deploy between once per month and once every six months. The gap is typically explained by CI/CD maturity, test automation, and organizational trust.
Lead time for changes measures velocity: how long from code committed to code running in production? Elite teams measure this in under one day. The metric captures pipeline friction, review bottlenecks, and deployment ceremony. It is genuinely useful for identifying process problems.
Change failure rate measures quality: what percentage of changes to production result in degraded service requiring remediation? Elite teams keep this below 5%. This is the balancing metric that prevents teams from optimizing deployment frequency at the expense of stability.
Mean time to restore (MTTR) measures resilience: when something breaks, how quickly does the team fix it? Elite teams restore within one hour. This metric rewards investment in monitoring, on-call processes, and incident management.
Together, these four metrics form a balanced scorecard for software delivery. Speed without quality is reckless. Quality without speed is stagnation. DORA captures both tensions in four numbers that fit on one slide. That is genuinely valuable, and abandoning it entirely would be a mistake.
Where DORA Falls Short
DORA's limitations are not bugs in the framework. They are scope boundaries. DORA measures software delivery performance. It does not measure engineering effectiveness, developer experience, business impact, or strategic alignment. The problem is not DORA. The problem is treating DORA as a complete measurement system when it was always designed to be one input among several.
1. Developer Experience Is Invisible
A team can have Elite DORA metrics and miserable developers. Fast deployment pipelines do not mean developers enjoy their work. Lead time can look great while engineers spend 40% of their time fighting flaky tests, waiting for environments to spin up, or wrestling with a codebase that nobody documented. DORA measures the pipeline. It does not measure the human beings feeding the pipeline. Developer attrition, satisfaction, and cognitive load sit entirely outside the DORA frame.
This matters financially. Replacing a senior engineer costs 6-9 months of salary in recruiting, onboarding, and ramp-up productivity loss. A team with Elite DORA metrics and 30% annual attrition is not high-performing. It is burning engineers to maintain the metrics, which is unsustainable.
2. Business Impact Is Assumed, Not Measured
DORA's research shows that high-performing teams correlate with better organizational outcomes. But correlation at the population level does not mean your specific team's deployment frequency is causing better business results. A team deploying 50 times per day on a feature nobody uses is not creating value. A team deploying twice a week on the revenue-critical checkout flow might be creating enormous value.
The missing link is impact attribution. Which deployments moved business metrics? Which features justified their engineering cost? DORA measures throughput and stability but not whether the throughput produced anything the business needed. For boards and executives, this gap makes DORA data interesting but not actionable.
3. AI and ML Workflows Break the Model
DORA's four metrics assume a specific workflow: code is written, reviewed, merged, deployed, and either works or fails. AI engineering does not follow this pattern. A model fine-tuning run takes weeks and produces no "deployments" until evaluation completes. A prompt engineering cycle might produce 30 "deployments" (prompt versions) in a day, each of which is trivial in isolation but part of a larger optimization effort.
Change failure rate breaks entirely on probabilistic systems. A new model version is never binary pass/fail. It improves accuracy on English inputs by 4% while degrading multilingual performance by 1.5%. Is that a failure? DORA cannot answer because DORA assumes deterministic outcomes.
The 2024 and 2025 State of DevOps Reports acknowledged this gap but offered no framework for addressing it. In 2026, with most engineering organizations running at least one AI team, this is not a niche concern. It is a fundamental measurement gap.
4. Platform and Infrastructure Teams Are Poorly Served
A platform team's "customers" are internal developers. Their "deployments" might be library releases, API version bumps, or infrastructure provisioning changes. Deployment frequency for an internal platform team is meaningless because the team might intentionally limit deployment frequency to avoid breaking consumers. Lead time is distorted by internal negotiation cycles that do not exist in product engineering.
What matters for platform teams: internal adoption rate, developer satisfaction with the platform, self-service completion rate, and time-to-productivity for new engineers onboarding onto the platform. None of these map to DORA's four metrics.
5. Strategic Alignment Is Out of Scope
DORA tells you whether engineering is shipping fast and reliably. It does not tell you whether engineering is shipping the right things. A team with Elite DORA metrics working on features that do not align with company strategy is a well-oiled machine pointed in the wrong direction. Strategic alignment, measured as the percentage of engineering effort allocated to the company's top priorities, is arguably more important than delivery speed.
The SPACE Framework
SPACE was published in 2021 by Nicole Forsgren (also behind DORA), Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. It came out of research at GitHub and Microsoft, informed by the observation that developer productivity is multi-dimensional and cannot be captured by any single metric or pair of metrics.
The acronym represents five dimensions:
Satisfaction and Wellbeing
How fulfilled and healthy developers feel about their work, team, tools, and culture. Measured through surveys: job satisfaction scores, willingness to recommend the team, burnout indicators. This is the dimension most engineering organizations skip, and it is among the strongest predictors of retention. Teams that score consistently low on satisfaction bleed engineers, and no amount of compensation reliably stops it once people have decided the day-to-day is broken.
Performance
The outcomes of work, not the volume. Quality of code delivered, impact on customer satisfaction, reliability of systems, business value produced. This is explicitly not "did they ship a lot" but "did what they shipped matter and work correctly." Measured through outcome metrics: customer satisfaction changes, defect rates, business KPI impact.
Activity
Observable actions: commits, code reviews, deployments, documents written, on-call incidents handled. Activity metrics are the easiest to collect and the most dangerous to use in isolation. High activity with low performance means busy work. Low activity with high performance means efficient work. Activity only becomes meaningful when paired with the other four dimensions.
Communication and Collaboration
How effectively people and teams work together. Code review turnaround time, cross-team coordination friction, knowledge sharing patterns. In practice, teams with fast code review cycles (under 4 hours for first response) ship noticeably faster than teams where reviews sit for a day or more, because every blocked review stalls the next piece of work behind it. This dimension captures the organizational friction that DORA misses entirely.
Efficiency
The ability to complete work with minimal waste and friction. Build times, environment provisioning speed, time spent on toil versus creative work, interruption frequency. A team that spends 35% of its time waiting for builds, fighting configuration issues, and attending status meetings has an efficiency problem that no amount of deployment frequency improvement will fix.
SPACE in Practice
The framework's strength is its completeness. Its weakness is implementation complexity. You cannot instrument SPACE overnight. Satisfaction requires surveys. Performance requires outcome attribution. Communication metrics require collaboration tool instrumentation. Most organizations that attempt full SPACE implementation end up measuring 2-3 dimensions well and the other 2-3 poorly.
My recommendation: start with Satisfaction (quarterly survey, 10 questions, takes 30 days to set up) and Efficiency (build times and toil tracking, instrumentable from existing CI/CD data). These two dimensions surface problems that DORA misses and are actionable without complex attribution models.
The DX (Developer Experience) Framework
DX was formalized by Abi Noda, Margaret-Anne Storey, and Nicole Forsgren in a 2023 paper, "DevEx: What Actually Drives Productivity". Where SPACE provides a broad model with five dimensions, DX deliberately narrows focus to three dimensions that the researchers found most predictive of developer productivity and satisfaction.
Feedback Loops
How quickly developers get information about their code. CI build time is the canonical example: a team with 3-minute builds iterates 5-10x faster than a team with 45-minute builds. But feedback loops extend beyond CI. How fast does a code review come back? How quickly can a developer see their change running in a staging environment? How long does it take to get answers from another team about an API contract?
Rough working benchmarks: the fastest teams I work with keep CI times under 10 minutes, first code review response under 4 hours, and staging deployment under 15 minutes. The slowest sit at CI over 30 minutes, first review response over 24 hours, and staging deployment over 2 hours. The gap between those two profiles is enormous in day-to-day developer experience.
Cognitive Load
The mental effort required to complete tasks. High cognitive load comes from poorly documented systems, inconsistent tooling, sprawling microservice architectures with unclear ownership, and the constant context-switching of modern development. A developer who needs to consult three wikis, two Slack channels, and a tribal-knowledge holder to understand how to make a change is carrying unnecessary cognitive load.
Measurement: DX uses survey questions ("How easy is it to understand the codebase you work with?", "How often do you need to consult others to complete routine tasks?") scored on a 7-point scale. Teams scoring below 4 on cognitive load questions typically have 2x the cycle time of teams scoring above 5, because every task requires extra investigation before implementation can begin.
Flow State
The ability to enter and sustain focused, productive work. Flow state requires uninterrupted blocks of time (typically 2+ hours), clear task definitions, and minimal administrative overhead. The primary destroyers of flow state in engineering organizations: meetings that fragment the day into sub-2-hour blocks, Slack/Teams notification culture that expects real-time response, and unclear priorities that force developers to context-switch between tasks.
Actionable benchmark: developers who report 4+ hours of uninterrupted coding time per day score 2x higher on self-reported productivity than those with less than 2 hours. The intervention is structural (meeting-free afternoons, async-first communication policies, clear sprint commitments) rather than individual.
DX vs SPACE: When to Use Which
DX is more actionable. Each dimension maps directly to specific interventions: slow feedback loops have an engineering fix (speed up CI, parallelize tests), high cognitive load has a process fix (better documentation, simpler architecture), and broken flow state has an organizational fix (meeting policy, notification culture). SPACE is more comprehensive but requires more organizational maturity to act on all five dimensions simultaneously.
For engineering organizations with fewer than 100 engineers, DX is usually the better starting point. It is simpler to implement, produces actionable insights faster, and the three dimensions cover the friction that most mid-size teams struggle with. For organizations above 200 engineers, SPACE's additional dimensions (especially Communication and Collaboration) become critical because cross-team friction dominates at scale.
Framework Comparison
Here is how the major frameworks stack up across the dimensions that matter to engineering leaders.
| Dimension | DORA | SPACE | DX |
|---|---|---|---|
| Delivery speed | Strong (deployment frequency, lead time) | Partial (Activity dimension) | Indirect (feedback loops speed delivery) |
| Delivery stability | Strong (change failure rate, MTTR) | Partial (Performance dimension) | Not covered |
| Developer satisfaction | Not covered | Strong (Satisfaction dimension) | Indirect (flow state correlates with satisfaction) |
| Developer productivity friction | Not covered | Strong (Efficiency dimension) | Strong (cognitive load, feedback loops) |
| Business impact | Not covered | Partial (Performance dimension) | Not covered |
| Team collaboration | Not covered | Strong (Communication dimension) | Not covered |
| AI/ML team fit | Poor (assumes deterministic workflows) | Moderate (multi-dimensional helps) | Moderate (cognitive load applies to AI work) |
| Ease of implementation | High (4 metrics, instrumentable from CI/CD) | Low (5 dimensions, surveys + instrumentation) | Medium (3 dimensions, survey-driven) |
| Industry benchmarks available | Yes (annual State of DevOps Report) | Limited (no standard benchmark set) | Yes (via DX platform, since 2024) |
| Tooling ecosystem | Strong (LinearB, Sleuth, Jellyfish, Faros AI) | Weak (no dedicated tooling) | Moderate (DX platform, Pluralsight Flow) |
Emerging Metrics for the AI Era
None of the existing frameworks were designed for engineering organizations where 20-40% of the team is building AI systems. The gap is not just about AI team metrics (covered in our AI team metrics guide). It is about how AI tooling is changing the work patterns of every engineer, including those on traditional platform and product teams.
AI-Assisted Development Metrics
By mid-2026, most engineering teams use AI coding assistants (GitHub Copilot, Cursor, Cline, Claude Code, Windsurf). This creates new measurement needs:
- AI suggestion acceptance rate: What percentage of AI-generated code suggestions does the team accept? A rate below 15% suggests the AI tool is poorly configured for your codebase. Above 50% suggests over-reliance that may introduce subtle bugs. The sweet spot for most teams is 25-40%.
- AI-assisted vs manual defect rate: Do code sections written with AI assistance have higher or lower defect rates than manually-written code? The pattern most teams report is that AI-assisted code holds up fine when review stays rigorous, and degrades when reviewers start trusting plausible-looking output without scrutiny.
- Time saved per task category: Where does AI assistance actually save time? Boilerplate generation and test writing show the largest gains (30-50% time reduction). Complex architectural decisions and debugging show minimal gains and sometimes negative gains when the AI suggestion sends the developer down the wrong path.
AI Team-Specific Metrics
For teams building AI features (not just using AI tools), the measurement framework shifts fundamentally:
- Experiment velocity: Experiments run per sprint, conversion to production. Target 8-15 experiments per sprint for a 5-person team, 20-30% conversion rate.
- Model quality envelope: Multi-dimensional quality tracking (accuracy, latency, cost, fairness) with defined acceptable ranges rather than single-metric optimization.
- Inference cost trajectory: Cost per request over time. Should be declining quarter-over-quarter as the team optimizes prompts, caching, model selection, and batching.
- Eval suite maturity: Coverage and pass rate of automated evaluation suites. The AI equivalent of test coverage. Target: eval coverage on 100% of production prompts and models, pass rate stable or improving.
Cross-Cutting Metrics for Hybrid Organizations
Most engineering organizations in 2026 run a mix of traditional software teams and AI teams. You need metrics that work across both:
- Time-to-value: Days from idea to measurable business impact. Works for both a traditional feature (idea to revenue-attributed deploy) and an AI feature (experiment to production with measured user impact). The common denominator is business impact, which removes the apples-to-oranges problem of comparing deployment counts.
- Cost per business outcome: Total team cost divided by business outcomes delivered. For product teams, outcomes might be features with measured user adoption. For AI teams, outcomes might be automation cost savings or AI feature revenue. For platform teams, outcomes might be developer hours saved through tooling improvements.
- Developer experience score: Quarterly survey covering the DX dimensions (feedback loops, cognitive load, flow state) that applies equally to all team types. A platform engineer frustrated by 45-minute build times and an AI engineer frustrated by 3-hour training runs are both experiencing broken feedback loops.
Building Your Measurement Stack
Do not adopt a framework wholesale. Build a measurement stack that answers your specific questions by combining elements from multiple frameworks. Here is the process I use with CTOs:
Step 1: Identify Your Questions (Week 1)
Every measurement system should answer specific questions. Write them down. Common examples:
- Are we shipping what we committed to? (Delivery predictability)
- Is our engineering investment efficient? (Cost per outcome)
- Are our developers productive and happy? (DX + satisfaction)
- Is production stable? (Reliability)
- Are our AI investments paying off? (AI-specific business impact)
Limit yourself to 4-6 questions. Each question maps to 1-2 metrics. More than 12 total metrics means nobody will track any of them consistently.
Step 2: Instrument What You Have (Weeks 2-4)
Most organizations already collect 60-70% of the data they need but do not surface it. Check these sources:
- CI/CD pipeline: Deployment frequency, lead time, build times, test pass rates. Tools: GitHub Actions, GitLab CI, Jenkins, CircleCI all expose these natively.
- Incident management: Incident count, MTTR, severity distribution. Tools: PagerDuty, Opsgenie, incident.io, Rootly.
- Project tracking: Cycle time, throughput, scope changes. Tools: Jira, Linear, Shortcut all have analytics views.
- Observability: Uptime, latency percentiles, error rates. Tools: Datadog, Grafana, New Relic.
Step 3: Add Survey Data (Month 2)
The dimensions that matter most (satisfaction, cognitive load, flow state) require survey data. Run a 15-question quarterly survey covering:
- Overall job satisfaction (1-10)
- Tooling satisfaction (1-10)
- Cognitive load: "How easy is it to understand the systems you work with?" (1-7)
- Flow state: "How many hours of uninterrupted coding time do you get per day?" (numeric)
- Collaboration: "How quickly do code reviews come back?" (hours)
- One open-ended: "What is the biggest source of friction in your daily work?"
Response rate target: 70%+. Below that, selection bias makes the data unreliable. Anonymize responses to get honest input. Use a dedicated survey tool (Culture Amp, Officevibe, DX platform), not a Google Form buried in Slack.
Step 4: Set Baselines, Then Targets (Month 3)
Do not set targets before you have baseline data. Your first quarter is measurement-only. Second quarter, set targets collaboratively with teams based on their own baseline. "Your cycle time baseline is 24 days. Can we get it to 18 by end of Q3?" is actionable. "Industry benchmark is 14 days, get there by next quarter" is demoralizing and ignores your specific context.
Step 5: Review Cadence
Three rhythms, three audiences:
| Cadence | Audience | Metrics | Format |
|---|---|---|---|
| Weekly | Engineering managers + teams | Cycle time, throughput, build health, incident status | Shared dashboard, reviewed in team retros |
| Monthly | VP Engineering / CTO | Delivery predictability, DX survey trends, cost per outcome, AI metrics | 30-minute review meeting with commentary |
| Quarterly | Board / C-suite | Commitment reliability, uptime, eng cost / revenue, AI business impact | One-page slide with trends and forward-looking indicators |
Common Implementation Mistakes
Having helped engineering organizations adopt metrics frameworks over the past several years, these are the failure patterns I see repeatedly:
Measuring individuals instead of teams
The moment you rank individual engineers by commit count, review speed, or lines of code, you create a culture of gaming and competition that destroys collaboration. An engineer who spends a day helping three teammates unblock is more valuable than one who writes 500 lines of code in isolation, but individual metrics cannot capture that. Measure teams. Review individuals through 1-on-1s and peer feedback, not dashboards.
Adopting too many metrics at once
Engineering organizations that try to implement full SPACE (5 dimensions, 15+ metrics) in one quarter end up measuring nothing well. Start with 3-4 metrics you can instrument reliably. Add dimensions as you build the habit of reviewing and acting on data. A team that religiously tracks cycle time, developer satisfaction, and incident count will outperform a team that half-tracks 20 metrics.
Setting targets before baselines
"Our deployment frequency should be daily." Based on what? If your current deployment frequency is weekly and your architecture requires a 4-hour integration test suite, daily deploys require infrastructure investment, not just a target. Always collect 8-12 weeks of baseline data before setting targets. Targets without baselines are aspirations, not management.
Using metrics as a stick
When metrics become a performance management tool rather than a diagnostic tool, teams game them. I watched an organization where deployment frequency became a team KPI. Teams started splitting changes into tiny, trivial deployments to hit the target. Deployment frequency tripled. Value delivery did not change. Use metrics to identify friction and improvement opportunities, not to rank or punish teams.
Ignoring the qualitative
Numbers without narratives mislead. Cycle time increased from 14 to 22 days last quarter. Bad? Maybe the team took on a complex migration that was strategically necessary and inherently slower. Numbers tell you what changed. Conversations with team leads tell you why. Always pair quantitative dashboards with qualitative commentary.
Related Guides
Engineering Metrics Your Board Actually Cares About
Which metrics matter when reporting to the board. Leading vs lagging indicators, vanity metrics, and the three questions every board asks.
AI Team Metrics
How to measure AI team performance beyond traditional engineering metrics. KPIs for LLM engineering that boards and CTOs actually use.