I Built the Same Agent in 6 Orchestration Frameworks

LangGraph, CrewAI, AutoGen, OpenAI Swarm, Pydantic AI, LlamaIndex — same 3-step agent built in each. DX, cost, reliability, and debuggability scored.

Thomas Prommer Hands-on CTAIO

Tech executive who ships products for a living. Tests every tool with real budget before writing about it.

Published April 23, 2026 · Experiment in progress — results with Season 2 audio

Season 2 · In Progress

The framework builds are done. Audio recording is on the calendar. Scorecard and cost table publish with the Season 2 podcast episodes. Subscribe below if you want the numbers when they land.

Key Takeaways

The hard problem is the same in every framework: making the agent stop, retry, and hand off reliably. — Six frameworks, same 3-step agent. The syntax changed between builds. The actual problem didn't. Boilerplate and ergonomics vary. Whether your agent loop holds up over 20 runs is independent of which framework you picked.
Developer experience gaps compound over a two-week build. — Tutorials hide the difference. 11pm stack traces expose it — specifically when a handoff has been stalled for an hour and the trace points at the framework instead of your code. Debuggability matters more than feature completeness.
Count the transitive dependencies before you commit. — One of these six has 40+ transitive deps that all pin conflicting versions of LangChain. That isn't a framework. It's a maintenance backlog with a pip install command. The dependency graph drives your upgrade risk more than API ergonomics ever will.

The Experiment

I built the same agent in six orchestration frameworks. Each build was scored on the things a production team cares about: developer experience, debugging quality, cost per run, and how reliably it runs 20 times in a row without human intervention.

The test agent is deliberately simple. A 3-step research pipeline: search the web, extract structured claims from the top results, format a markdown report with citations. Small enough to build in a day. Complicated enough to expose how each framework handles state, tool calling, and error propagation.

The Six Frameworks

LangGraph

LangGraph models your agent as a directed graph (cyclic if you need loops). State is explicit and typed. Nodes are Python callables. Edges are conditional routing functions. You get the most precise control over execution flow of anything in this comparison. The cost: the most boilerplate. Built by the LangChain team and shares the LangChain ecosystem.

Best for: Teams already using LangChain. Complex multi-agent workflows where execution order matters. Production use cases where you need full observability into every state transition.

CrewAI

CrewAI sits one level up from the graph. You define a Crew (the team), Agent objects (role, goal, backstory), and Task objects. The framework handles delegation. The on-ramp is the fastest here. A working multi-agent system in 30 lines of Python. The abstraction charges you interest when you need to debug a stuck handoff or figure out why an agent picked the wrong branch.

Best for: Teams new to multi-agent systems. Rapid prototyping. Use cases where task decomposition is stable and well understood.

AutoGen (Microsoft)

AutoGen frames everything as conversations between agents. The ConversableAgent primitive handles both LLM agents and function-calling tools. Microsoft's enterprise DNA is visible: Azure OpenAI integration is first-class, docs are solid, support is active. Conversational framing is intuitive for some workflows and clumsy for others.

Best for: Azure-heavy shops. Workflows where back-and-forth agent dialogue maps cleanly onto the task. Teams that want Microsoft standing behind the framework.

OpenAI Swarm

Swarm is the minimal option. Two concepts: Agents and handoffs. An agent calls a handoff function that transfers control to another agent. That's it. Swarm is intentionally small. It's an educational reference implementation from OpenAI, explicitly flagged as not for production. Running it in production means building state persistence, retry logic, and observability yourself.

Best for: Learning the agent handoff mental model. Prototypes where you control the full stack. Teams that want to understand what the bigger frameworks build on top of before taking on the abstraction.

Pydantic AI

Pydantic AI brings type safety to the agent layer. Tool calls are typed functions. Agent inputs and outputs are Pydantic models. Hallucinated tool arguments fail at validation with a clear error instead of silently corrupting downstream state. The framework has the lowest impedance to standard Python engineering practice. Existing type checkers, test fixtures, and IDE support all work without configuration.

Best for: Python teams that care about type safety and testability. Use cases where structured output correctness is critical. Engineers coming from FastAPI or modern Python backends.

LlamaIndex Agents

LlamaIndex is best known as a RAG framework, and its agent primitives reflect that heritage. If your agent's main job is querying your own documents, LlamaIndex integrates most naturally with its retrieval infrastructure. For general-purpose orchestration without a strong RAG component, the framework adds overhead compared to its peers.

Best for: Document-centric agents. RAG pipelines that need orchestration. Teams already running LlamaIndex for retrieval.

What We Score

Developer Experience

Time to working implementation. Framework-specific boilerplate lines. How fast you can answer "why did this run fail?" from logs alone.

Debuggability

Quality of traces in the native debug tooling. How much information the framework exposes about state at each step. Whether errors point to the actual problem.

Cost Per Run

Token consumption for the same 3-step task. Frameworks that require extra LLM calls for routing or planning inflate cost without improving output.

Reliability

Success rate over 20 repeated runs. Failure mode quality — does it fail noisily or silently? Does retry logic need to be built from scratch?

Dependency Health

Transitive dependency count. Pin conflicts with common Python ML stacks. Cadence of breaking changes in the last 6 months.

Production Maturity

Availability of state persistence solutions. Built-in support for async execution. Community size and enterprise adoption signals.

What I Found So Far

Without front-running the full scorecard: the frameworks split into two camps faster than I expected. One camp has strong opinions about state (LangGraph, Pydantic AI). The other defers state management back to you (Swarm, CrewAI). That one design choice drives almost everything about your debugging experience when something breaks in production.

The framework-by-framework breakdown, the cost comparison table, and the surprise from the 20-run reliability test ship with the Season 2 podcast episodes.

What I Got Wrong Predicting This

I went into this experiment with two strong priors. Both turned out to be partly wrong.

First: I assumed CrewAI would be the easiest on-ramp for anyone new to multi-agent systems. The 30-lines-of-Python demo is real. You can get a working crew running very quickly. What I didn't predict was how leaky the abstraction gets the first time you need to understand why an agent made a particular decision. The Crew/Agent/Task model is intuitive when the happy path works. Once you're trying to trace a bad handoff, you're reading framework internals instead of your own code. The fast start trades off against the debugging ceiling.

Second: I assumed LangGraph's boilerplate would be a deal-breaker for small teams. It isn't. The typed state graph feels heavy on day one and pays you back by week two. When something breaks, the trace you get from LangSmith plus the explicit state definition usually surfaces the bug in the first 10 minutes. The boilerplate sucks on day one and saves you from overnight debugging sessions starting day ten.

One more thing worth flagging for anyone evaluating these frameworks today: most of them are still pre-1.0. Breaking API changes between minor versions are common. LangGraph shipped three breaking changes in the six months before I ran this experiment. CrewAI shipped two. If your team is betting a product on one of these, pin your versions and read the changelog before every upgrade. Treat the framework as a volatile dependency, not a stable platform.

The lesson I keep coming back to: framework comparisons written after the tutorial are different from framework comparisons written after two weeks of real use. These are rough notes from week two.

FAQ

Which agent orchestration framework should I pick in 2026?

Depends on your team and your use case. If you already use LangChain, LangGraph is the obvious extension. Tight integration, mature docs. If you want something lighter with less opinion built in, Pydantic AI is worth evaluating. AutoGen from Microsoft has strong enterprise backing. CrewAI is the fastest on-ramp for teams new to multi-agent systems. OpenAI Swarm is the simplest conceptually and the least production-ready. LlamaIndex is strongest when the agent's main job is retrieval over documents. Full scorecard publishes when Season 2 audio drops.

What is the difference between LangGraph and CrewAI?

LangGraph models your agent system as a directed graph. Nodes are functions or agents. Edges are transitions. State is explicit and typed. You get fine-grained control over execution flow in exchange for more boilerplate. CrewAI uses a higher-level abstraction. You define a Crew, Agents, and Tasks and the framework handles orchestration. CrewAI is faster to start. LangGraph gives you more control when things go wrong. Both ran the same 3-step test agent in this experiment. Full comparison ships with Season 2.

Is LangGraph production-ready?

Yes, with caveats. LangGraph is the most mature option here, with the strongest community and documentation. The main operational concerns: state persistence (LangGraph Cloud handles this, self-hosted requires your own store), and the graph definition overhead adds onboarding friction for engineers who don't already know LangChain. For teams already inside the LangChain ecosystem, it's the safest production choice today.

What is Pydantic AI and why does it matter?

Pydantic AI is an agent framework built on top of Pydantic, Python's most widely-used data validation library. The core premise: agent inputs and outputs are just typed Python objects. Your IDE gives you autocomplete. Your tests get real type coverage. Hallucinated tool calls fail fast with a validation error instead of silently corrupting state. For teams that care about type safety and testability, Pydantic AI has the lowest impedance to standard Python engineering practice.

How does the 3-step research agent test work?

The test agent does three things: (1) web search on a given query, (2) summarize the top 3 results into structured claims, (3) format the claims as a markdown report with sources. Same prompt, same tools, same target output across all six frameworks. I measure: time to working implementation, lines of framework-specific code, trace quality in the observability layer, cost per run, and reliability over 20 repeated runs. Results publish with the Season 2 podcast episodes.

S2E2

Monolith, Handoff, or Swarm? Three Topologies in Production

Architecture patterns for multi-agent systems — when each topology works and when each one breaks under load. Production failure modes documented.

S2E3

Langfuse vs LangSmith vs Arize Phoenix vs Helicone

Four observability tools tested with the same agent harness. What each one actually shows you, and what it hides.

Previously

No previous episodes yet — this is where it all starts.

Now Playing

Building My AI Clone · E08

Voice AI products to clone your voice

Kubernetes won the orchestration war. Now what? Platform teams, developer experience, and the next wave of container tooling.

DockerKubernetesTerraformHelmPulumi+4

Listen

Up Next

Building My AI Clone · E09

AI video of yourself

Identity is the new perimeter. How zero-trust IAM is becoming the foundation of enterprise security architecture.

OktaAuth0Azure AD / Entra IDAWS IAMOasis Security

I Built the Same Agent in 6 Orchestration Frameworks

Key Takeaways

The Experiment

The Six Frameworks

LangGraph

CrewAI

AutoGen (Microsoft)

OpenAI Swarm

Pydantic AI

LlamaIndex Agents

What We Score

What I Found So Far

What I Got Wrong Predicting This

FAQ

Which agent orchestration framework should I pick in 2026?

What is the difference between LangGraph and CrewAI?

Is LangGraph production-ready?

What is Pydantic AI and why does it matter?

How does the 3-step research agent test work?

No comments yet. Be the first!

The CTAIO Lab Podcast

Previously

Now Playing

Voice AI products to clone your voice

Up Next

AI video of yourself

Key Takeaways

The Experiment

The Six Frameworks

LangGraph

CrewAI

AutoGen (Microsoft)

OpenAI Swarm

Pydantic AI

LlamaIndex Agents

What We Score

What I Found So Far

What I Got Wrong Predicting This

FAQ

Which agent orchestration framework should I pick in 2026?

What is the difference between LangGraph and CrewAI?

Is LangGraph production-ready?

What is Pydantic AI and why does it matter?

How does the 3-step research agent test work?

No comments yet. Be the first!

The CTAIO Lab Podcast

Previously

Now Playing

Voice AI products to clone your voice

Up Next

AI video of yourself

CTAIO — Technology Leadership for the AI Era