AI Team Design
AI Team Topology: Where LLM Engineers Actually Sit
The Org Design That Ships AI
Team Topologies (Skelton & Pais, 2019) gave us four fundamental team types: stream-aligned, enabling, complicated-subsystem, and platform. The framework reshaped how engineering organizations think about cognitive load and team interaction. But where does the AI team fit? None of them. Or all of them, depending on your maturity stage.
Most organizations get this wrong. They either centralize AI into a bottleneck, scatter it across product teams without shared infrastructure, or build a platform nobody uses. Conway's Law punishes every one of these mistakes by encoding bad org design directly into the AI system architecture.
THE PROBLEM
Team Topologies wasn't designed for AI
The original Team Topologies framework assumes teams build and own discrete services. The interaction modes (collaboration, X-as-a-Service, facilitating) assume bounded, predictable interfaces between teams. AI breaks these assumptions.
First, AI teams produce probabilistic systems. A platform team shipping a CI/CD pipeline can define a stable contract: push code, get a build artifact. An ML platform team shipping a model serving layer cannot guarantee deterministic behavior. The "contract" between the platform and its consumers includes model performance, latency budgets, drift detection, and evaluation criteria that change with every retraining cycle. This is fundamentally different from traditional platform engineering.
Second, AI work spans the entire stack. A single recommendation feature touches data engineering, model training, model serving, API design, frontend integration, and A/B testing infrastructure. No single Team Topologies type cleanly owns all of that. Stream-aligned teams lack ML infrastructure knowledge. Platform teams lack product context. Enabling teams lack the sustained ownership to see a model through to production.
Third, the talent market forces premature decisions. ML engineers and AI engineers are expensive and scarce. Organizations hire one or two, put them somewhere in the org chart, and that initial placement calcifies into a topology that persists long after it stops making sense. The first hire's manager becomes the "AI team lead" by default, not by design.
The result is topology drift: the AI team's formal position in the org chart diverges from the actual work being done. Shadow processes appear. Infrastructure gets duplicated. The classic "we need to talk to the AI team" bottleneck slows down every product team simultaneously.
THE FRAMEWORK
Four models for AI team placement
Every organization deploying AI lands on one of these four topologies, whether they chose it deliberately or drifted into it. Each has a natural habitat and a failure mode that triggers the reorg.
Centralized AI Team
The CoE ModelOne team owns all AI work for the entire company. Product teams submit requests, the AI team prioritizes and delivers. The team typically reports to a VP of Engineering or CTO, sits adjacent to data engineering, and operates its own backlog independent of product sprints.
When it works: Early maturity (0-2 models in production). The company is still figuring out what AI can do for them. Demand is low enough that one team can serve it without creating a months-long queue. The initial models are experimental, and centralizing expertise accelerates learning. Startups under 30 engineers and enterprises in the first year of an AI initiative fit here.
When it breaks: The moment demand outpaces bandwidth. Product teams start waiting weeks for the AI team's attention. Priority conflicts become political. The centralized team becomes a gatekeeper rather than an enabler. Two failure modes: (a) the team becomes so specialized in one domain that it can't context-switch to another, or (b) it spreads so thin across domains that quality drops everywhere.
Team Topologies mapping: The centralized AI team functions as an enabling team at best, a complicated-subsystem team at worst. Enabling when it's actively transferring knowledge to product teams and building toward its own dissolution. Complicated-subsystem when it's hoarding expertise and creating permanent dependencies.
Real example: A mid-market SaaS company (200 engineers, $50M ARR) stands up a 4-person ML team to ship their first recommendation engine. The team succeeds because they have one customer (the product team building recommendations) and one model to ship. Eighteen months later, five product teams want AI features, the ML team is a 6-month bottleneck, and the reorg begins.
Embedded AI Engineers
The Feature Team ModelAI engineers sit directly inside product teams. They report to the product team's engineering manager, attend the same stand-ups, and ship AI features alongside frontend and backend engineers. There is no centralized AI team; each product team owns its own AI capabilities end-to-end.
When it works: When AI is a feature, not the product. When the AI work is primarily integration (calling LLM APIs, building RAG pipelines, tuning prompts) rather than model development. When each product team's AI needs are distinct enough that shared infrastructure would be premature abstraction. AI-augmented products (vs. AI-native products) fit this model well.
When it breaks: Infrastructure duplication. Team A builds a vector database integration, Team B builds a different one. Team C builds a prompt management system, Team D rolls their own. Within a year you have four RAG architectures, three evaluation frameworks, and zero shared model registry. Inconsistency in AI behavior across the product surface becomes a UX problem. Embedded engineers get isolated from ML peers and stop developing their craft.
Team Topologies mapping: AI engineers are simply part of stream-aligned teams. This is the purest Team Topologies implementation, but it only works when the AI work is bounded enough to fit inside a single team's cognitive load budget without requiring deep ML platform knowledge.
Real example: A product-led growth company (80 engineers) where three product teams each have one AI engineer building LLM-powered features (smart search, content generation, customer support chatbot). Works fine for 18 months. Then the CEO asks "why do our three AI features behave so differently?" and "why are we paying three separate vector DB bills?" and the reorg conversation starts.
Platform AI Team
The ML Platform ModelThe AI team builds and operates shared ML infrastructure: model registries, feature stores, experiment tracking, model serving, evaluation pipelines, prompt management, vector databases, and guardrails. Product teams consume this platform to build their own AI features without needing deep ML ops expertise. The platform team ships tooling, not models.
When it works: At scale (10+ models in production, 5+ teams building AI features). When the infrastructure duplication from the embedded model has become too expensive. When you have enough internal demand to justify a dedicated platform investment. When the platform team has clear users and can define X-as-a-Service contracts with measurable SLOs.
When it breaks: When the platform gets too far from product reality. The classic failure: the platform team builds what they think product teams need (usually more abstraction, more configurability) while product teams need something simpler (a working example, good defaults, fast iteration cycles). The platform becomes overengineered for the sophistication of its users. Second failure mode: the platform team has no embedded context, so they can't help when things go wrong in production. Product teams still need "someone who knows ML" on the team, bringing you back to the hybrid model.
Team Topologies mapping: A clean platform team in the Team Topologies sense. Operates X-as-a-Service with stream-aligned teams as consumers. The interaction mode is clearly defined: product teams call the platform's APIs, the platform team maintains the infrastructure. This is the most natural Team Topologies fit of the four models.
Real example: Spotify's ML platform team. Uber's Michelangelo. Any company that has enough models in production that the build/serve/monitor loop needs to be a shared, maintained service rather than copy-pasted boilerplate in each team's repo. At this scale, the platform team is 8-15 engineers and product teams have 0-2 AI engineers each who focus on application logic, not infrastructure.
Hybrid Hub-and-Spoke
The Consensus ModelA central AI platform team provides shared infrastructure and best practices. Embedded AI specialists sit inside product teams and consume the platform while maintaining product context. An enabling function (sometimes called "AI guild" or "ML community of practice") connects the embedded specialists so they share learnings and prevent drift. Three layers: platform (the hub), embedded specialists (the spokes), and the connecting tissue (the guild).
When it works: For organizations with 50+ engineers and 3+ teams building AI features. This is the consensus model for most companies past the experimentation phase. It solves the three failure modes simultaneously: no bottleneck (product teams have their own AI people), no duplication (shared platform handles infrastructure), no isolation (guild connects embedded specialists). It maps directly to Team Topologies: a platform team, stream-aligned teams with embedded AI engineers, and a thin enabling team that facilitates knowledge sharing.
When it breaks: Coordination overhead. Three layers means three places where misalignment can hide. The platform team builds something the spokes don't use. The spokes go rogue and build custom solutions because the platform is "too slow." The guild becomes a talking shop without decision-making authority. The tax on this model is continuous alignment work: the Head of AI (or equivalent) spends 40% of their time on internal coordination rather than technical work.
Team Topologies mapping: This IS Team Topologies applied correctly. Platform team (AI infrastructure) + stream-aligned teams (with embedded AI specialists) + enabling team (the guild/community of practice). The interaction modes are: X-as-a-Service between platform and stream-aligned teams, facilitating between the enabling team and everyone else. Collaboration mode activates temporarily when a new capability is being built (platform collaborates with the first product team to use it, then shifts to X-as-a-Service for subsequent teams).
Real example: A Series D fintech (300 engineers, 40 in AI/ML across 8 product teams). Central ML platform team of 12 handles model serving, feature store, experiment tracking, and evaluation infrastructure. Each product team has 2-4 AI engineers who build models and RAG pipelines on the platform. Bi-weekly ML guild meetings share learnings. A Head of AI (reporting to CTO) owns the platform team directly and has dotted-line influence over embedded AI hires.
DECISION FRAMEWORK
Maturity-based topology selection
Pick based on where you are today, not where you want to be in three years. Premature platformification is as dangerous as staying centralized too long.
| Maturity Stage | Models in Prod | Recommended Topology | Why |
|---|---|---|---|
| Experimentation | 0-2 | Centralized | Concentrate scarce expertise. Learn what works before distributing. The team's job is to prove AI value, not scale it. |
| Early Production | 3-5 | Embedded or early Hub-and-Spoke | Demand now exceeds centralized capacity. Either embed engineers into the teams with highest AI leverage, or begin splitting into platform + embedded. |
| Scaling | 6-10 | Hub-and-Spoke | Infrastructure duplication is now expensive. Platform investment has clear ROI. Enough teams are using AI that patterns have emerged. |
| At Scale | 10+ | Platform + Embedded Specialists | Full platform with well-defined contracts. Embedded specialists are experienced enough to operate independently. Guild keeps alignment. |
| AI-Native | AI is the product | Stream-aligned AI teams from day one | When AI is the core product (not a feature), AI engineers ARE the product engineers. No need for a separate "AI team" topology. Build a platform when infrastructure needs warrant it, same as any other engineering platform decision. |
The reorg trigger: You need to change topologies when the queue for AI team time exceeds 4-6 weeks, when product teams start building shadow AI infrastructure, when model quality varies wildly across teams, or when your best AI engineers quit because they feel isolated from peers. Any one of these signals means your current topology has outlived its usefulness.
ROLE TAXONOMY
The roles that actually exist in 2026
Job titles in AI are still chaotic. The same person might be called "ML Engineer" at one company and "AI Engineer" at another while doing completely different work. The distinctions matter for org design because each role has different infrastructure needs, team placement, and career ladders.
ML Engineer
Focus: Model development
Trains models from scratch or fine-tunes foundation models. Works with training pipelines, feature engineering, hyperparameter optimization, and offline evaluation. Needs GPU clusters, experiment tracking (Weights & Biases, MLflow), and large datasets. Reports to: ML team lead or Head of AI. Typically sits in the platform or centralized team.
AI Engineer
Focus: Model integration
Integrates pre-trained models (especially LLMs) into production systems. Builds RAG architectures, agent frameworks, prompt pipelines, and evaluation harnesses. Needs API access, orchestration frameworks (LangChain, LlamaIndex, custom), and production deployment tooling. Reports to: product team EM or AI team lead. Typically embedded in product teams.
ML Platform Engineer
Focus: ML infrastructure
Builds and operates the shared ML platform: model serving (Triton, vLLM, TGI), feature stores, model registries, training orchestration, monitoring, and cost management. Pure infrastructure role with ML domain knowledge. Reports to: platform team lead. Lives in the platform team. Does not build models.
AI Product Manager
Focus: AI product strategy
Different from regular PMs because AI products have non-deterministic behavior, require different evaluation methods (not just A/B tests), and need ongoing monitoring post-launch. Must understand model capabilities and limitations well enough to set realistic expectations with stakeholders. Partners with AI engineers on evaluation criteria and failure-mode documentation.
Prompt Engineer
Focus: LLM behavior design
A contested role. In 2024, companies hired dedicated prompt engineers. By 2026, most organizations treat prompting as a skill that AI engineers own rather than a standalone position. Where it persists as a role: companies with dozens of LLM-powered features that need systematic prompt management, versioning, and A/B testing across a shared prompt registry. Otherwise, it is folded into the AI engineer scope.
AI Safety / Red Team
Focus: Model security and alignment
Tests AI systems for failure modes: prompt injection, harmful outputs, bias, data leakage, jailbreaks. Reports to: CISO (if security-first), Head of AI (if product-first), or operates as an independent function with a direct line to leadership. Should NOT report to the same team that built the system being tested. See our AI red teaming guide for implementation details.
The distinction that drives org design: ML Engineers and ML Platform Engineers belong in centralized or platform teams because their work is horizontal (serving multiple product surfaces). AI Engineers belong in product teams because their work is vertical (shipping one feature end-to-end). AI Product Managers go wherever the AI engineers are. Red Team is independent by definition. Getting this wrong means you either starve product teams of the people they need (by centralizing AI engineers who should be embedded) or you fragment infrastructure expertise (by embedding platform engineers who should be centralized).
CONWAY'S LAW
Your AI architecture already mirrors your org chart
Melvin Conway observed in 1967 that organizations design systems mirroring their communication structures. In AI, this law is especially punishing because the feedback loop stays invisible until production.
If the AI team is isolated from product: your AI features will feel bolted on. The model will be technically impressive but poorly integrated into the user workflow. The handoff between "model output" and "product experience" will be a JSON blob thrown over a wall.
If the AI team is fragmented across product teams: your models won't share infrastructure. You'll have three different vector databases, two incompatible evaluation frameworks, and zero institutional knowledge about what works. Each team reinvents lessons the other already learned.
If the AI team reports exclusively to engineering: your AI features will be technically excellent but product-blind. They'll optimize for model performance metrics (F1 score, latency) rather than user outcomes (task completion, time saved). The system will be over-architected for problems users don't have.
If the AI team reports exclusively to product: your AI infrastructure will be underinvested. Each feature ships with custom glue code, no shared evaluation pipeline, no model monitoring, no cost tracking. The "AI debt" compounds until a major incident forces a platform investment.
The topology you choose is the architecture you'll get. Pick the one that produces the architecture you want, not the one that's easiest to hire into.
Read the full analysis: Conway's Law in the Age of AI Teams
IMPLEMENTATION
The 90-day topology transition
Reorgs fail when they're announced as big-bang changes. Topology transitions work when they're executed as a series of small, reversible moves with clear success criteria at each step.
Week 1-2: Audit the current state. Map where every AI engineer actually spends their time (not their job description, their calendar). Identify the shadow processes: who do product teams actually go to when they need AI help? Where is infrastructure being duplicated? Where are people waiting in a queue?
Week 3-4: Define the target topology. Use the maturity framework above. Name the teams, the reporting lines, and the interaction modes. Write down what "success" looks like in 90 days (measurable: queue time, model deployment frequency, infrastructure consolidation).
Week 5-8: Execute the transition. Move one team at a time. Start with the team that has the clearest value case for the new topology. Don't move everyone simultaneously. Each move creates a proof point that makes the next move easier to justify.
Week 9-12: Stabilize and measure. Are queue times shorter? Is infrastructure consolidation happening? Are embedded engineers still connected to their peers? Is the platform team shipping things product teams actually use? Adjust based on data, not opinions.
The hard part: Reporting line changes mean someone loses headcount. The VP who had 8 AI engineers on their team now has 3 (because 5 moved to the platform team). This is a political problem, not a technical one, and it requires executive sponsorship. If the CTO isn't willing to own the reorg, it won't stick.
ANTI-PATTERNS
Five AI team anti-patterns I've seen repeatedly
1. The "AI Center of Bottleneck." A centralized team that takes requests from everyone, prioritizes by squeakiest wheel, and delivers 3-6 months after the business needed the feature. Product teams route around it by hiring contractors or using no-code AI tools. The centralized team becomes irrelevant to the actual AI work happening in the company.
2. The "Lone Wolf" embedded engineer. One AI engineer in a product team of 12 backend/frontend engineers. They have no ML peers to review their work, no infrastructure support, and slowly become a full-stack engineer who occasionally does AI work. Their prompts are unversioned, their evaluation is manual, and when they leave, the AI feature breaks within two months.
3. The "Platform for Nobody." A platform team that builds elaborate ML infrastructure (Kubernetes operators, custom feature stores, bespoke model registries) that no product team is sophisticated enough to use. The platform team's roadmap is disconnected from actual product needs. Six months in, product teams are still deploying models via Jupyter notebooks and Docker containers because the platform's learning curve is too steep.
4. The "Research Lab." An AI team optimizing for paper-publishable results rather than production impact. They build state-of-the-art models that never ship because they require infrastructure that doesn't exist, data pipelines that aren't built, or latency budgets that aren't achievable. The team is evaluated on novelty, not production metrics.
5. The "Stealth AI Team." A product team that quietly accumulates AI capabilities without organizational awareness. They build an LLM-powered feature, then another, then another. By the time leadership notices, they have 5 models in production with no monitoring, no cost tracking, no security review, and no documentation. A prompt injection incident or a surprise $50K GPU bill triggers a panicked reorg.
EXPLORE THE AI TEAM DESIGN CLUSTER
Deep dives on AI organizational design
Conway's Law in the Age of AI Teams
Your AI system architecture will mirror your team structure. How to use this law deliberately instead of being punished by it.
AI Center of Excellence
The transitional CoE model: how to build it, when to dissolve it, and how to avoid the ivory-tower failure mode.
AI Team Org Chart
Reporting structures, role ladders, and actual org charts for AI teams at 20, 50, 200, and 500+ engineers.
AI Technical Debt
ML systems accrue debt differently than software. Model rot, pipeline debt, and the hidden cost of undocumented prompts.
AI Team Topology: Frequently Asked Questions
What type of team should AI/ML be in Team Topologies?
Should the AI team report to the CTO or the CPO?
How big should an AI team be for a Series B startup?
What's the difference between an ML engineer and an AI engineer?
When should you hire a Head of AI vs embed engineers in product teams?
Is the AI Center of Excellence model dead?
AI org design in your inbox
Team topology decisions, hiring frameworks, and field-tested AI leadership patterns. Written by a CTO who has restructured AI teams, not an analyst describing them.