The math behind how LLMs are trained and served – Reiner Pope
TL;DR
Reiner Pope explains the mathematical mechanics behind LLM inference costs, demonstrating how 'Fast Mode' APIs charge premiums for smaller batch sizes that reduce latency, and why physical memory bandwidth constraints create hard limits on how fast or cheap inference can get regardless of budget.
🚀 The 'Fast Mode' Premium 3 insights
Premium pricing buys lower batch sizes
Fast Mode charges 6x more for 2.5x speed because serving fewer concurrent users reduces wait times but eliminates the cost amortization benefits of large batches.
Memory bandwidth sets a hard latency floor
Even with infinite budget, latency cannot drop below the time required to fetch all model weights from memory, preventing arbitrary speedups through pricing tiers.
'Slow Mode' savings hit a compute floor
Delaying requests to increase batch sizes reduces cost only up to a point, after which per-token compute and KV cache fetching costs dominate and cannot be amortized further.
⚡ Roofline Analysis & Hardware Physics 3 insights
Inference time is the max of compute or memory
Latency is determined by whichever is slower—performing matrix multiplications (compute bound) or fetching weights and KV cache (memory bound)—creating distinct operational regimes.
The 300:1 FLOPs-to-bandwidth ratio constrains design
Modern GPUs like Blackwell maintain roughly 300:1 ratios of compute FLOPs to memory bandwidth (in FP4), which determines the minimum batch size needed to achieve compute efficiency.
Context length linearly increases memory pressure
For standard dense attention, KV cache fetching grows linearly with sequence length, potentially shifting the system from compute-bound to memory-bound and drastically reducing Model FLOPs Utilization.
💰 Cost Optimization & The Batch Size Sweet Spot 3 insights
Cost curves follow inverse scaling
Cost per token decreases hyperbolically with batch size as weight fetching costs are amortized across more sequences, asymptotically approaching the irreducible compute cost floor.
Optimal batch sizes are 2,000–3,000 sequences
To fully amortize memory bandwidth costs and reach compute-bound operation, systems must batch approximately 300 times the model's sparsity ratio—typically 2,000 to 3,000 concurrent sequences for frontier models.
Sparse attention alters the scaling laws
Sparse attention architectures like DeepSeek's reduce KV cache fetching to scale with the square root of context length rather than linearly, fundamentally changing the latency trade-offs for long-context inference.
Bottom Line
The fundamental trade-off in LLM serving is that lower latency requires smaller batches which increases cost per token, and this relationship is bounded by physical memory bandwidth limits that cannot be overcome regardless of budget.
More from Dwarkesh Patel
View all
Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat
Jensen Huang explains how Nvidia's 'electrons to tokens' full-stack ecosystem and massive supply chain commitments create a durable moat against commoditization and TPU competition, while arguing that AI agents will exponentially increase software tool usage rather than replace it.
Michael Nielsen – How science actually progresses
Michael Nielsen dismantles the pop-science narrative of linear scientific progress through crisp experiments, revealing instead a messy, decentralized process where mathematical formalism often precedes conceptual understanding, expertise can blind researchers to truth, and communities adopt paradigm shifts long before experimental closure.
Terence Tao – Kepler, Newton, and the true nature of mathematical discovery
Mathematician Terence Tao compares Kepler's twenty-year process of testing random hypotheses against Tycho Brahe's dataset to modern AI capabilities, arguing that while artificial intelligence has eliminated the bottleneck of idea generation in science, it has simultaneously created an unprecedented crisis in verification and validation that current peer review systems cannot handle.
Dylan Patel — The Single Biggest Bottleneck to Scaling AI Compute
Dylan Patel explains that Big Tech's $600B CapEx represents multi-year pre-purchases of power and data centers through 2029, while AI labs face an immediate crunch where Anthropic's conservative compute strategy forces them to pay massive premiums on spot markets compared to OpenAI's aggressive long-term contracting.