Spec-Driven Development With AI Agents: I Ran a Real Build to Test It

What spec-driven development actually is

Spec-driven development is the answer to a question I kept hitting after my vibe coding weekend: if shipping code no human read is a liability for anything you have to maintain, what do you put in its place once agents are doing the typing? The answer that practitioners converged on through late 2025 is to move the authoritative artifact up a level. Stop treating the code as the source of truth. Treat a written specification as the source of truth, and let agents generate the code from it.

Here is the mechanism, stripped of the marketing. You author a structured spec: the objectives of a feature, the data shapes it touches, the constraints it must hold, and the edge cases it must handle. An AI agent reads that spec and generates the implementation. When you want a change, you do not edit the code. You edit the spec, and the agent regenerates the affected code to match. The code is a build artifact, the way a compiled binary is a build artifact. The thing a human owns and edits is the spec. When the spec and the code disagree, the spec wins. That inversion is the whole idea, and it is the exact opposite of vibe coding, where the running app is authoritative and the spec, if one exists at all, lives in a chat transcript.

The arxiv survey of the field puts it cleanly: spec-driven development "treats the spec, not the code, as the primary artifact," with code as the "last-mile" output generated from and validated against a specification that serves as the single source of truth for both human developers and AI agents (Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants). That phrase, single source of truth for both humans and agents, is the part that matters. The spec is the contract that two very different kinds of reader, one of them statistical, both work from.

Why this is showing up now

Spec-driven development is not new as an idea. We have written specs before code for decades, and most of us abandoned them the moment the code started moving, because keeping a spec in sync with a living codebase by hand was a losing battle nobody had time for. What changed is that the cost of regenerating code from a spec collapsed. When an agent can rebuild a component from an updated spec in seconds, the spec stops being documentation that rots and becomes the executable origin of the code. The economics flipped, and an old discipline became practical for the first time.

Anthropic's 2026 Agentic Coding Trends report names the pressure that makes this matter. It found that developers now use AI for roughly 60% of their work but report being able to fully delegate only 0 to 20% of tasks (2026 Agentic Coding Trends Report; summarized by Pathmode). That delegation gap is the whole story of agentic coding right now. People reach for agents constantly and trust them with almost nothing end to end. The report's argument is that structured intent specifications are the safeguard that closes the gap, the thing that "replaces code-level understanding with outcome-level confidence." A loose prompt leaves the agent to guess at your constraints. A spec tells it. When the person orchestrating the agent cannot read the output well enough to catch a subtle error, the spec is what catches it instead.

The maturity levels

Not all spec-driven development is the same depth, and conflating the levels is where most of the confusion comes from. Practitioners describe three, and they form a ladder of how much authority you hand to the spec.

Level one, spec-first. You write the specification before any code, and the spec constrains what the agent generates. But the code is still the primary deliverable you maintain after generation. The spec gets you a clean start and a shared understanding, then you live in the code like always. This is the lightest version, and it is most of what people mean when they first try the technique.

Level two, spec-anchored. The spec stays authoritative across the whole lifecycle, not just at the start. You add governance on top: review checkpoints, constitutional constraints the agent must respect, and an audit trail of why the system does what it does. This is the level that suits anything regulated, long-lived, or touched by more than one person. The spec is the durable record, and the code is regenerated against it as it changes.

Level three, spec-as-source. The furthest extent. The spec is the only artifact a human edits, and code is regenerated wholesale from it. You do not open the implementation, ever, the way a C programmer does not open assembly. This is where the field is pointed, and it is also where the real risk lives, because regenerating an entire system from a spec is a big-bang release with all the failure modes big-bang releases have always had.

The honest state of the art: most teams operate at level one or two and aspire toward level three (three maturity levels; Augment Code's guide). I ran my test at level two, spec-anchored, because it is the level you can actually ship a real feature on today without pretending the regeneration risk away. So that is the level the rest of this is about.

What I built, and what it cost

The brief I set myself was deliberately harder than the vibe coding test, because spec-driven development earns its keep on maintenance, not on the first sprint. I picked a feature I would then have to change twice: a small invoicing service for my consulting work, with line items, tax handling for two jurisdictions, and a PDF export. The point was not just to build it once. It was to build it from a spec, then change a requirement, and see whether editing the spec and regenerating was actually better than editing code by hand.

The stack, chosen so the spec genuinely drove the code rather than decorating it:

Claude Code as the agent, run against a single living spec file I kept in the repo root and treated as authoritative. $20 Pro plan plus metered overflow once I exceeded the included usage on day two. $79.40 of actual agent spend across the four days.
A handwritten spec in Markdown: objectives, a data model section, explicit tax rules per jurisdiction, an edge-case list (zero-quantity lines, rounding, a credit note that goes negative), and acceptance criteria the agent had to satisfy. This file was the work. Everything else was generated.
Railway for deploy, a Node service with a Postgres database. $18.30 for the four-day window including the database add-on.
A domain I already owned, so $0 there. Total out of pocket: $107.62.

Three timeboxes across four days in late May 2026. Tuesday 14:00 to 18:40 to write the first spec and generate. Wednesday 10:00 to 15:20 for the first requirement change. Friday 09:30 to 13:00 for the second change and the deploy. Just under nineteen hours of wall-clock time. The rule I held myself to was the inverse of the vibe coding rule: I do not edit generated code directly. If something is wrong, I find the gap in the spec, fix the spec, and regenerate. The moment I hand-patch a generated file, I have broken spec-anchored discipline and fallen back to ordinary AI-assisted coding.

What worked

The first generation was slower to arrive than a vibe loop and dramatically more correct when it did. Writing the spec took me the entire first afternoon, far longer than describing an app conversationally. But when Claude Code generated the service from it, the tax logic was right on the first pass, including the jurisdiction split, because I had written the two rule sets out explicitly instead of hoping the model would infer them. The edge cases I had bothered to list, the negative credit note in particular, came back handled. The ones I had not listed did not, which is exactly the lesson and I will get to it.

The real payoff landed on Wednesday, on the first requirement change. The client rule I changed was the rounding behaviour: round per line item rather than on the invoice total, a change that touches calculation, storage, and the PDF. In a hand-maintained codebase that is a careful three-file edit with a real chance of missing one site. I edited two sentences in the spec's tax section and one acceptance criterion, told the agent the spec had changed, and it regenerated all three touch points consistently. That consistency is the thing spec-driven development sells, and on this change it delivered exactly as advertised. The spec was a single place to express an intent that fanned out across the system, and the fan-out was the agent's problem, not mine.

The second-order benefit I did not expect: the spec made the diff reviewable. Because I was reviewing generated code against a written intent I had authored that morning, I could actually tell whether the output was correct, which is the precise thing that breaks down in vibe coding. I was not reading code against vibes. I was reading code against a contract.

What broke

The failures were almost all spec failures, and that is the honest and slightly uncomfortable finding. On Friday's second change I asked for partial payments against an invoice. My spec said an invoice has a paid status. It did not say what happens to that status when a payment covers part of the balance, and it did not define where a partial payment is recorded. The agent resolved the ambiguity the way ambiguity always gets resolved: against me. It generated a boolean paid flag that flipped to true on any payment, so a $10 payment on a $1000 invoice marked it settled. The code was internally consistent and did exactly what my underspecified spec implied. The bug was in the contract, not the implementation.

I lost about an hour to this, and the loop was instructive precisely because it was not the vibe coding loop. In vibe coding I would have described the wrong behaviour and let the agent guess at a fix, possibly moving the bug. Here the fix was to go back to the spec, add a partial-payment section with an explicit amount-paid field and a status enum of unpaid, partial, and paid, and regenerate. The bug could not move, because the spec now pinned the behaviour down. But, and this is the part nobody selling spec-driven development emphasizes, I only found the gap by running the generated app and seeing the wrong status. The spec did not catch its own omission. The running system did. The spec is authoritative, but it is not self-validating, and an incomplete spec produces confidently wrong code just as readily as a vague prompt does.

The other thing that broke was less dramatic and more structural. Regeneration is not always surgical. On Wednesday's rounding change the agent regenerated cleanly, but on Friday a larger spec edit caused it to also rewrite a helper I had not intended to touch, in a slightly different style, which would matter on a real team where someone had built tooling around the old shape. This is the spec-as-source risk arriving early: the bigger the spec change, the closer regeneration gets to a big-bang release, and the less you can predict which untouched parts come back altered. At level two I could review and catch it. At level three, generating wholesale, I am not sure I would have.

The honest verdict against vibe coding

These two techniques are not competitors so much as tools for opposite phases, and treating them as rivals is the mistake. Vibe coding is the fastest path I have ever used from idea to a running answer, and it is genuinely bad at evolving a system, because the intent lives only in a chat history that the model forgets and you never wrote down. Spec-driven development is slower to first output, sometimes painfully so, and it is genuinely good at evolving a system, because the intent is a versioned document that survives the session, the model, and the engineer.

Put the numbers side by side. My vibe coding weekend got me ninety percent of an app in three hours and then cost me ninety minutes failing to fix a bug in code nobody had read. My spec-driven build cost me a full afternoon before the first line of code existed, and then absorbed two requirement changes in minutes each, with diffs I could actually review. The vibe loop optimizes the green-field sprint and mortgages everything after it. Spec-driven development pays a tax up front and earns it back at exactly the moment that costs real money, which is the part where you change a system you have to keep.

So the verdict is not that one wins. It is that they have a boundary, and the boundary is the same one I found in the vibe coding test, drawn from the other side. Vibe coding is the right tool when the blast radius of a failure is a redeploy and the lifespan is short. Spec-driven development is the right tool when the system has to survive, when more than one person touches it, and when someone will need to know in a year why it behaves the way it does. The spec is the answer to that last question, and a chat transcript is not.

My rule after both experiments is one sentence. Use vibe coding to decide whether to build, and spec-driven development to build the thing you have decided to keep. If you want the longer argument for the orchestration layer above both, I worked through it in the agentic orchestration framework comparison, where the same question, how do you get reliable work out of agents you cannot fully trust, shows up at the level of multi-agent systems instead of single files.

What it means for your team

The delegation gap that Anthropic measured, near-universal AI use and almost no full delegation, is not a problem you solve by buying a better model. It is a problem you solve by writing better specifications, because the spec is the interface between an intent you hold and an agent you cannot fully audit. That reframes what the scarce skill is. It is not producing code, which agents do now, and it is not even reviewing code, which matters but does not scale. It is the ability to write a specification precise enough that a statistical system builds the right thing from it the first time, and complete enough that the gaps do not resolve against you.

That is a real skill, and most engineers do not have it yet, because for twenty years the spec was the part you skipped to get to the code. The teams that will get the most out of agentic coding are the ones that treat spec-writing as a first-class engineering discipline rather than documentation overhead. My four days proved the smaller version of that claim: every bug I hit was a spec bug, which means every bug was something a sharper specification would have prevented. The work did not disappear when the agent started typing. It moved up a level, to the document, where it has arguably always belonged.

FAQ

What is spec-driven development?

Spec-driven development is a way of building software where a structured specification, not the code, is the primary artifact humans author and maintain. You write and edit a living spec that states the objectives, constraints, data shapes, and edge cases of a feature; AI agents then generate the implementation from that spec and regenerate it when the spec changes. The code becomes a build output validated against the spec rather than the thing engineers edit by hand. The defining move is that the spec is authoritative: when the spec and the code disagree, the spec wins.

How is spec-driven development different from vibe coding?

They put authority in opposite places. In vibe coding the running app is the source of truth: you describe a change, the agent writes code you do not read, and you respond to the behaviour you see. In spec-driven development the written specification is the source of truth: you edit the spec, the agent regenerates code to match it, and you review against the spec rather than against vibes. Vibe coding is faster to a first working version and worse at evolution, because the intent lives only in a chat history. Spec-driven development is slower up front and far better at maintenance, because the intent is a durable, versioned document.

Do AI agents work better with a spec?

Yes, measurably, and Anthropic's own 2026 data explains why. Their Agentic Coding Trends report found developers now use AI for roughly 60% of their work but can fully delegate only 0-20% of tasks. That delegation gap closes when the agent is handed a structured intent spec instead of a loose prompt, because the spec supplies the objectives, constraints, and edge cases that a one-line prompt leaves the agent to guess at. In my own build, the difference between a feature that came back correct on the first generation and one that needed four rounds was almost entirely the precision of the spec I wrote, not the model.

What are the maturity levels of spec-driven development?

Practitioners describe three. Spec-first means you write the specification before any code, and the spec constrains what the agent generates, but code stays the primary deliverable you maintain. Spec-anchored keeps the spec authoritative across the whole lifecycle and adds governance: review checkpoints, constitutional constraints, and audit trails, which suits regulated or long-lived systems. Spec-as-source is the furthest extent, where the spec is the only artifact a human edits and code is regenerated wholesale from it. Most teams today operate at spec-first or spec-anchored and aspire toward spec-as-source, which still carries real risk around big-bang regeneration.

Is spec-driven development worth it?

It is worth it for systems you have to keep, and overkill for things you will throw away. The cost is real: writing a spec precise enough for an agent to build from is slower than just describing a change, and on a weekend prototype that overhead is pure tax. The payoff shows up at maintenance time, when a versioned spec lets you regenerate a component, onboard a teammate, or audit why the system behaves as it does, none of which a chat transcript gives you. My rule after the build: use vibe coding to decide whether to build, and spec-driven development to build the thing you are committing to maintain.