ctaio.dev Ask AI Subscribe free

AI ROI / Compute Bottleneck Planning

AI ROI · Capacity Planning

The AI Compute Bottleneck

A CTO Framework for Capacity Planning in 2026

Anjney Midha sat with TBPN on 2026-05-05 for a full interview titled “Fixing AI’s Biggest Bottleneck” and made a thesis the a16z infrastructure team has been pushing for two years: compute is the most constrained resource in AI, and most of the constraint is not where vendor decks place it. This page is the operational version. The thesis as a worksheet: utilization audit on the current fleet, reservation portfolio decisions, multi-cloud hedge sized to the real risk, and a small set of build-vs-buy gates that turn the planning conversation into a decision rather than a debate.

AI Compute Bottleneck: A CTO Capacity Planning Framework (2026)

30-SECOND EXECUTIVE TAKEAWAY

  • The bottleneck has moved. In 2024 it was raw GPU supply. In 2026 it is power, networking topology, and scheduler maturity. The vendor pitch still emphasizes silicon; the spend that recovers utilization is in the integration layer.
  • Utilization without HBM headroom is fragile. A 90 percent utilization number with 99 percent HBM occupancy cannot absorb context-window growth. Track HBM as a separate gate.
  • Hedge the reservation, not the cloud. Multi-cloud is expensive theatre at most scales. A primary cloud plus a secondary capacity provider for burst is the cheaper hedge that actually works.

What Midha actually argued, and why it matters operationally

Midha’s thesis in the 2026-05-05 TBPN interview is that the binding constraint on AI is no longer the existence of GPUs but the rate at which they can be turned into useful tokens-per-second for a specific workload. Power delivery at the site, network topology between racks, kernel maturity for the model family, and scheduler fairness across teams all sit upstream of the GPU count and all are routinely under-funded in enterprise plans. AMP, the underlying a16z investment thesis, points at the same gap from a different angle: there is meaningful underutilized compute in the system, and the bottleneck is the software and operations layer that turns commodity GPU access into usable capacity.

At a CTO desk this translates into a planning question with a different shape than the one most teams are answering. The question is not “how many H100s do we reserve next quarter,” although that question still has to be answered. The question is what fraction of the capacity already in the building or already on contract is actually producing useful work, what the binding constraint is on the part that is not, and which capex or opex line closes the gap fastest. The remainder of this page is the worksheet for that conversation.

UTILIZATION AUDIT

Six metrics, all visible to finance

Run these on the current fleet before any capacity decision. Most enterprises in 2026 carry one or two of the six in their dashboards; the others surface only when an executive sponsor escalates. Visibility to finance is the gate that produces the funding for the software layer that actually closes the gap.

GPU-hour utilization

Average and p95 across the fleet, by workload class. Below 40 percent on a reserved fleet is a red flag for over-provisioning. Above 90 percent sustained is a queueing problem waiting to surface.

HBM headroom per workload

How much HBM each model actually uses at peak batch and context. Models near 100 percent HBM utilization cannot absorb context-window growth without re-deployment.

Tokens per second per dollar

The cost-of-inference number that matters. Track it per model and per workload, not just per GPU class. Vendor pricing shifts move this faster than annual planning.

Time to first token at p95

The customer-facing latency metric. p50 looks fine on most fleets; p95 is where the queue depth becomes visible. Tail latency is the SLA, not the average.

Scheduler fairness across teams

Whether one team’s workload is starving another. Most enterprises do not measure this and discover the problem when an executive sponsor escalates.

Reservation vs on-demand ratio

Dollar mix between committed and elastic capacity. Below 70 percent reservation on a stable workload is leaving discount on the table; above 90 percent removes the hedge against demand shifts.

The reservation portfolio question

Reservation versus on-demand versus spot is the line that produces the largest single dollar savings or losses in an AI infrastructure plan. The savings on reservation are real, in the range of 30 to 60 percent against on-demand at most providers. The losses on bad reservations are also real, in the form of multi-year commitments to a GPU class that depreciated faster than expected. The right portfolio depends on the workload mix and the time horizon, and it changes each quarter as the model generation moves.

The gating table below is the version of the decision that survives quarterly review. It maps the workload type to the recommended capacity model and states the reason. It is intentionally short. A reservation portfolio document longer than two pages is usually a document that is hiding the trade-off rather than naming it.

WorkloadRecommendationWhy
Workload is stable and persistent (more than 6 months runway) Reserve Discounts of 30 to 60 percent vs on-demand justify the lock-in.
Workload is exploratory or experimental On-demand or spot Reservation lock-in costs more than the on-demand premium if the workload moves or dies.
Training run, multi-day, large-cluster Reserve dedicated capacity Preemption cost on spot exceeds the discount; checkpoint recovery is not free.
Batch inference, offline evaluation Spot or interruptible Preemption is tolerable; savings are 50 to 80 percent.
Real-time customer-facing inference Reserve with autoscale headroom SLA exposure dominates the savings calculus.
Demand is uncertain or seasonal Mix reserved + on-demand Reservation covers baseline; on-demand absorbs spikes.
Generation-sensitive workload (next-gen GPU expected in 6 to 12 months) Short-term reservation only Multi-year lock-in on current generation strands capex when the next class ships.

The multi-cloud hedge, sized to the real risk

Multi-cloud as a hedge against provider failure or pricing surprise is one of those infrastructure positions that looks responsible on a slide and costs more than it saves in operation. The egress cost between providers is the line item that gets cited first, and it is real, but the larger cost is the team capability tax: building, monitoring, and securing two cloud footprints is roughly 1.6 to 1.8x the capability cost of building one, not 2x because of shared tooling, but never 1x. For most enterprises the cheaper hedge is a primary cloud plus a secondary capacity provider for burst and reservation diversification.

The secondary provider for burst is usually one of the specialist GPU clouds: CoreWeave, Lambda, Crusoe, or a regional equivalent. They price aggressively for capacity, they understand the workload class, and they will sell short-term reservations that the hyperscalers will not. The structure is: reserve baseline at the primary cloud at 12-month terms, run experimental and burst at the secondary provider on shorter terms, keep both contracts on the books, and re-evaluate the mix every quarter as the GPU generation moves.

Power, the constraint that does not show up until it does

Most enterprise plans treat power as a property of the data center contract, which it is until the moment the contracted power density turns out to be insufficient for the GPU class. H100 and beyond run hot enough that older data center halls cannot deliver the contracted density without retrofits, and the retrofit timeline is long enough that the GPUs arrive before the power does. Plan against measured power delivery at the site, not against contracted density, and ask the question before the order is placed.

At hyperscaler scale, power has become the binding constraint on capacity expansion. At mid-market enterprise scale, it is now the binding constraint on which sites can host the next generation. A capacity plan that does not state the power assumption per site, the contract duration, and the alternate-site fallback is a plan that will surface a power surprise during deployment, which is the most expensive moment to surface it.

Scheduler maturity, the line item that gets cut and should not

The cheapest dollar of utilization recovery in most enterprise AI fleets is the scheduler. Kubernetes plus a default scheduler will not produce fair allocation across teams or efficient packing of mixed-precision workloads. The mature options (Volcano, Run.AI now under NVIDIA, Slurm for HPC-style fleets, Yunikorn for fairness) all require setup investment and ongoing operation, and they all pay back in measurable utilization within one to two quarters when properly tuned.

The reason scheduler work gets cut from plans is that it does not photograph well. A 12 percent utilization improvement from scheduler tuning has the same effect on capacity as a 12 percent increase in GPU count, and costs roughly an order of magnitude less. The capex case for next quarter is in many enterprises a scheduler case rather than a procurement case; the difficulty is that the case is harder to defend in a board meeting because the line item is a software-and-people investment rather than a piece of named hardware.

The build-vs-buy gates for the capacity layer

Three questions decide the build-vs-buy posture on the capacity layer itself. The first: is your workload mix dominated by training or inference? Training-heavy mixes justify dedicated capacity and serious scheduler investment; inference-heavy mixes lean toward managed services and autoscaling. The second: do you have in-house ML systems engineering capability? Without it, the build option becomes ongoing technical debt and managed services win the TCO comparison even at higher unit cost. The third: is the workload differentiated enough that the capacity layer is a competitive advantage? For most enterprises the answer is no, and that is the gate that points at buying managed inference and reserving capacity rather than building a platform.

The cross-link here is to the talent side of the same question. The build option for the capacity layer requires ML systems engineers with rare combinations of skills, and the compensation reflects it. The breakdown by mission lives at the AI engineer salary by mission guide on the sibling site, and the Nvidia compute angle on equity sits at the AI engineer equity and RSU breakdown. The capacity plan and the talent plan are the same plan from two angles.

AI Compute Bottleneck Planning: Frequently Asked Questions

What is GPU allocation in an enterprise AI context?
GPU allocation is the process of assigning fractional or whole GPU capacity to specific workloads, teams, or models across a fleet. In practice it spans three layers: reservation (multi-month committed capacity, on-prem or cloud), scheduling (which job gets which GPU at which moment, usually via Kubernetes plus a scheduler like Volcano, Run.AI, or Slurm), and accounting (chargeback per team or model so utilization is visible to finance). A capacity plan that addresses only one of the three layers will leak utilization at the other two.
Do GPUs use SRAM or DRAM?
Both. Modern training and inference GPUs (H100, H200, B200, MI300X) carry on-chip SRAM in the form of register file, shared memory, and L1/L2 caches, alongside high-bandwidth memory (HBM3, HBM3e) which is a DRAM variant. The HBM is what most discussions mean by "GPU memory" and is the binding constraint on model size and context length for inference. Capacity planning has to track HBM headroom per workload, not just GPU-hour throughput.
Which GPU has 10,000 CUDA cores?
A handful of recent NVIDIA GPUs cross that line. The H100 SXM5 has 16,896 CUDA cores; the H200 the same; the consumer RTX 4090 has 16,384. CUDA core count is a poor planning proxy on its own because tensor-core throughput and HBM bandwidth dominate large-model inference and training cost. Plan against measured tokens-per-second-per-dollar on the actual workload, not against vendor spec sheets.
Is 90 percent GPU utilization bad?
For an inference fleet, 90 percent sustained utilization is usually a sign of either healthy capacity use or a queueing problem about to surface. For training, sustained 90 percent is generally good. The number alone is not informative. The right metrics are tokens-per-second-per-dollar, time-to-first-token at p95, and queue depth during peak windows. A 90 percent number with bad p95 latency means the fleet is undersized for the workload mix; a 90 percent number with healthy p95 means the capacity is right.
What is the AI compute bottleneck in 2026?
Three constraints stack. First, power: the limit at the data-center site, not the GPU. Second, networking: NVLink, NVSwitch, and InfiniBand topology determine the size of the model and the batch that fits. Third, software: kernel maturity and scheduler fairness, which is the layer most enterprises under-invest in. Vendor narratives still emphasize raw GPU supply, but the bottleneck for most enterprises is the integration layer, not the silicon. Anjney Midha at a16z made this case in his full TBPN interview on 2026-05-05.
How do I decide between reserved capacity and spot for AI workloads?
Three rules. Training runs that span days to weeks should sit on reserved capacity because preemption cost dominates spot savings. Batch inference and offline evaluation can run on spot or interruptible capacity at meaningful discount. Real-time customer-facing inference should sit on reserved capacity with autoscaling headroom; spot is too volatile for SLA-bound traffic. Hedge the reservation horizon: shorter than 12 months at current model-generation cadence is usually right.
Should we run AI on a single cloud or multi-cloud?
Multi-cloud for AI workloads costs more in egress, integration, and team capability than the savings or hedge it produces, for most enterprises. The right hedge is usually a primary cloud plus a secondary capacity provider (CoreWeave, Lambda, Crusoe) for burst and for reservation diversification, plus an on-prem option for the largest training workloads if scale justifies it. Pure cloud-portability is expensive theatre at most enterprise scales.
What goes in a compute capacity plan for 2026?
Six sections. Workload inventory and forecast (training and inference, by team and model). Utilization audit on the current fleet, with HBM headroom and queue depth visible. Reservation portfolio (provider, term, GPU class, dollar exposure). Software and scheduler maturity status. Power and contract risk. Kill-switch triggers per workload class. The plan is reviewed quarterly because the inputs move faster than annual planning cycles can absorb.
·
Thomas Prommer
Thomas Prommer Technology Executive — CTO/CIO/CTAIO

These salary reports are built on firsthand hiring experience across 20+ years of engineering leadership (adidas, $9B platform, 500+ engineers) and a proprietary network of 200+ executive recruiters and headhunters who share placement data with us directly. As a top-1% expert on institutional investor networks, I've conducted 200+ technical due diligence consultations for PE/VC firms including Blackstone, Bain Capital, and Berenberg — work that requires current, accurate compensation benchmarks across every seniority level. Our team cross-references recruiter data with BLS statistics, job board salary disclosures, and executive compensation surveys to produce ranges you can actually negotiate with.

Continue the AI ROI cluster

Capacity is one slice of the ROI conversation. The capex case and the calculator carry the rest.