AI ROI · Capacity Planning
The AI Compute Bottleneck
A CTO Framework for Capacity Planning in 2026
Anjney Midha sat with TBPN on 2026-05-05 for a full interview titled “Fixing AI’s Biggest Bottleneck” and made a thesis the a16z infrastructure team has been pushing for two years: compute is the most constrained resource in AI, and most of the constraint is not where vendor decks place it. This page is the operational version. The thesis as a worksheet: utilization audit on the current fleet, reservation portfolio decisions, multi-cloud hedge sized to the real risk, and a small set of build-vs-buy gates that turn the planning conversation into a decision rather than a debate.
30-SECOND EXECUTIVE TAKEAWAY
- The bottleneck has moved. In 2024 it was raw GPU supply. In 2026 it is power, networking topology, and scheduler maturity. The vendor pitch still emphasizes silicon; the spend that recovers utilization is in the integration layer.
- Utilization without HBM headroom is fragile. A 90 percent utilization number with 99 percent HBM occupancy cannot absorb context-window growth. Track HBM as a separate gate.
- Hedge the reservation, not the cloud. Multi-cloud is expensive theatre at most scales. A primary cloud plus a secondary capacity provider for burst is the cheaper hedge that actually works.
What Midha actually argued, and why it matters operationally
Midha’s thesis in the 2026-05-05 TBPN interview is that the binding constraint on AI is no longer the existence of GPUs but the rate at which they can be turned into useful tokens-per-second for a specific workload. Power delivery at the site, network topology between racks, kernel maturity for the model family, and scheduler fairness across teams all sit upstream of the GPU count and all are routinely under-funded in enterprise plans. AMP, the underlying a16z investment thesis, points at the same gap from a different angle: there is meaningful underutilized compute in the system, and the bottleneck is the software and operations layer that turns commodity GPU access into usable capacity.
At a CTO desk this translates into a planning question with a different shape than the one most teams are answering. The question is not “how many H100s do we reserve next quarter,” although that question still has to be answered. The question is what fraction of the capacity already in the building or already on contract is actually producing useful work, what the binding constraint is on the part that is not, and which capex or opex line closes the gap fastest. The remainder of this page is the worksheet for that conversation.
UTILIZATION AUDIT
Six metrics, all visible to finance
Run these on the current fleet before any capacity decision. Most enterprises in 2026 carry one or two of the six in their dashboards; the others surface only when an executive sponsor escalates. Visibility to finance is the gate that produces the funding for the software layer that actually closes the gap.
GPU-hour utilization
Average and p95 across the fleet, by workload class. Below 40 percent on a reserved fleet is a red flag for over-provisioning. Above 90 percent sustained is a queueing problem waiting to surface.
HBM headroom per workload
How much HBM each model actually uses at peak batch and context. Models near 100 percent HBM utilization cannot absorb context-window growth without re-deployment.
Tokens per second per dollar
The cost-of-inference number that matters. Track it per model and per workload, not just per GPU class. Vendor pricing shifts move this faster than annual planning.
Time to first token at p95
The customer-facing latency metric. p50 looks fine on most fleets; p95 is where the queue depth becomes visible. Tail latency is the SLA, not the average.
Scheduler fairness across teams
Whether one team’s workload is starving another. Most enterprises do not measure this and discover the problem when an executive sponsor escalates.
Reservation vs on-demand ratio
Dollar mix between committed and elastic capacity. Below 70 percent reservation on a stable workload is leaving discount on the table; above 90 percent removes the hedge against demand shifts.
The reservation portfolio question
Reservation versus on-demand versus spot is the line that produces the largest single dollar savings or losses in an AI infrastructure plan. The savings on reservation are real, in the range of 30 to 60 percent against on-demand at most providers. The losses on bad reservations are also real, in the form of multi-year commitments to a GPU class that depreciated faster than expected. The right portfolio depends on the workload mix and the time horizon, and it changes each quarter as the model generation moves.
The gating table below is the version of the decision that survives quarterly review. It maps the workload type to the recommended capacity model and states the reason. It is intentionally short. A reservation portfolio document longer than two pages is usually a document that is hiding the trade-off rather than naming it.
| Workload | Recommendation | Why |
|---|---|---|
| Workload is stable and persistent (more than 6 months runway) | Reserve | Discounts of 30 to 60 percent vs on-demand justify the lock-in. |
| Workload is exploratory or experimental | On-demand or spot | Reservation lock-in costs more than the on-demand premium if the workload moves or dies. |
| Training run, multi-day, large-cluster | Reserve dedicated capacity | Preemption cost on spot exceeds the discount; checkpoint recovery is not free. |
| Batch inference, offline evaluation | Spot or interruptible | Preemption is tolerable; savings are 50 to 80 percent. |
| Real-time customer-facing inference | Reserve with autoscale headroom | SLA exposure dominates the savings calculus. |
| Demand is uncertain or seasonal | Mix reserved + on-demand | Reservation covers baseline; on-demand absorbs spikes. |
| Generation-sensitive workload (next-gen GPU expected in 6 to 12 months) | Short-term reservation only | Multi-year lock-in on current generation strands capex when the next class ships. |
The multi-cloud hedge, sized to the real risk
Multi-cloud as a hedge against provider failure or pricing surprise is one of those infrastructure positions that looks responsible on a slide and costs more than it saves in operation. The egress cost between providers is the line item that gets cited first, and it is real, but the larger cost is the team capability tax: building, monitoring, and securing two cloud footprints is roughly 1.6 to 1.8x the capability cost of building one, not 2x because of shared tooling, but never 1x. For most enterprises the cheaper hedge is a primary cloud plus a secondary capacity provider for burst and reservation diversification.
The secondary provider for burst is usually one of the specialist GPU clouds: CoreWeave, Lambda, Crusoe, or a regional equivalent. They price aggressively for capacity, they understand the workload class, and they will sell short-term reservations that the hyperscalers will not. The structure is: reserve baseline at the primary cloud at 12-month terms, run experimental and burst at the secondary provider on shorter terms, keep both contracts on the books, and re-evaluate the mix every quarter as the GPU generation moves.
Power, the constraint that does not show up until it does
Most enterprise plans treat power as a property of the data center contract, which it is until the moment the contracted power density turns out to be insufficient for the GPU class. H100 and beyond run hot enough that older data center halls cannot deliver the contracted density without retrofits, and the retrofit timeline is long enough that the GPUs arrive before the power does. Plan against measured power delivery at the site, not against contracted density, and ask the question before the order is placed.
At hyperscaler scale, power has become the binding constraint on capacity expansion. At mid-market enterprise scale, it is now the binding constraint on which sites can host the next generation. A capacity plan that does not state the power assumption per site, the contract duration, and the alternate-site fallback is a plan that will surface a power surprise during deployment, which is the most expensive moment to surface it.
Scheduler maturity, the line item that gets cut and should not
The cheapest dollar of utilization recovery in most enterprise AI fleets is the scheduler. Kubernetes plus a default scheduler will not produce fair allocation across teams or efficient packing of mixed-precision workloads. The mature options (Volcano, Run.AI now under NVIDIA, Slurm for HPC-style fleets, Yunikorn for fairness) all require setup investment and ongoing operation, and they all pay back in measurable utilization within one to two quarters when properly tuned.
The reason scheduler work gets cut from plans is that it does not photograph well. A 12 percent utilization improvement from scheduler tuning has the same effect on capacity as a 12 percent increase in GPU count, and costs roughly an order of magnitude less. The capex case for next quarter is in many enterprises a scheduler case rather than a procurement case; the difficulty is that the case is harder to defend in a board meeting because the line item is a software-and-people investment rather than a piece of named hardware.
The build-vs-buy gates for the capacity layer
Three questions decide the build-vs-buy posture on the capacity layer itself. The first: is your workload mix dominated by training or inference? Training-heavy mixes justify dedicated capacity and serious scheduler investment; inference-heavy mixes lean toward managed services and autoscaling. The second: do you have in-house ML systems engineering capability? Without it, the build option becomes ongoing technical debt and managed services win the TCO comparison even at higher unit cost. The third: is the workload differentiated enough that the capacity layer is a competitive advantage? For most enterprises the answer is no, and that is the gate that points at buying managed inference and reserving capacity rather than building a platform.
The cross-link here is to the talent side of the same question. The build option for the capacity layer requires ML systems engineers with rare combinations of skills, and the compensation reflects it. The breakdown by mission lives at the AI engineer salary by mission guide on the sibling site, and the Nvidia compute angle on equity sits at the AI engineer equity and RSU breakdown. The capacity plan and the talent plan are the same plan from two angles.
AI Compute Bottleneck Planning: Frequently Asked Questions
What is GPU allocation in an enterprise AI context?
Do GPUs use SRAM or DRAM?
Which GPU has 10,000 CUDA cores?
Is 90 percent GPU utilization bad?
What is the AI compute bottleneck in 2026?
How do I decide between reserved capacity and spot for AI workloads?
Should we run AI on a single cloud or multi-cloud?
What goes in a compute capacity plan for 2026?
Continue the AI ROI cluster
Capacity is one slice of the ROI conversation. The capex case and the calculator carry the rest.