The math behind how LLMs are trained and served – Reiner Pope
TL;DR
Reiner Pope explains the mathematical mechanics behind LLM inference costs, demonstrating how 'Fast Mode' APIs charge premiums for smaller batch sizes that reduce latency, and why physical memory bandwidth constraints create hard limits on how fast or cheap inference can get regardless of budget.
🚀 The 'Fast Mode' Premium 3 insights
Premium pricing buys lower batch sizes
Fast Mode charges 6x more for 2.5x speed because serving fewer concurrent users reduces wait times but eliminates the cost amortization benefits of large batches.
Memory bandwidth sets a hard latency floor
Even with infinite budget, latency cannot drop below the time required to fetch all model weights from memory, preventing arbitrary speedups through pricing tiers.
'Slow Mode' savings hit a compute floor
Delaying requests to increase batch sizes reduces cost only up to a point, after which per-token compute and KV cache fetching costs dominate and cannot be amortized further.
⚡ Roofline Analysis & Hardware Physics 3 insights
Inference time is the max of compute or memory
Latency is determined by whichever is slower—performing matrix multiplications (compute bound) or fetching weights and KV cache (memory bound)—creating distinct operational regimes.
The 300:1 FLOPs-to-bandwidth ratio constrains design
Modern GPUs like Blackwell maintain roughly 300:1 ratios of compute FLOPs to memory bandwidth (in FP4), which determines the minimum batch size needed to achieve compute efficiency.
Context length linearly increases memory pressure
For standard dense attention, KV cache fetching grows linearly with sequence length, potentially shifting the system from compute-bound to memory-bound and drastically reducing Model FLOPs Utilization.
💰 Cost Optimization & The Batch Size Sweet Spot 3 insights
Cost curves follow inverse scaling
Cost per token decreases hyperbolically with batch size as weight fetching costs are amortized across more sequences, asymptotically approaching the irreducible compute cost floor.
Optimal batch sizes are 2,000–3,000 sequences
To fully amortize memory bandwidth costs and reach compute-bound operation, systems must batch approximately 300 times the model's sparsity ratio—typically 2,000 to 3,000 concurrent sequences for frontier models.
Sparse attention alters the scaling laws
Sparse attention architectures like DeepSeek's reduce KV cache fetching to scale with the square root of context length rather than linearly, fundamentally changing the latency trade-offs for long-context inference.
Bottom Line
The fundamental trade-off in LLM serving is that lower latency requires smaller batches which increases cost per token, and this relationship is bounded by physical memory bandwidth limits that cannot be overcome regardless of budget.
More from Dwarkesh Patel
View all
How Machiavelli's Florence bargained with Cesare Borgia for survival – Ada Palmer
Ada Palmer explains that Machiavelli wrote *The Prince* during a crisis of institutional legitimacy in Italy, where constant papal interference and broken city-state continuity created chaos. His infamous advice was shaped by firsthand experience with Cesare Borgia, against whom Florence's only survival strategy was calculated submission—buying time through abject loyalty until fortune (in the form of a pope's death) intervened.
Sarah Paine - Why Russia and China can't escape geography
Sarah Paine argues that geography fundamentally constrains Russia and China to remain continental 'elephants' dependent on land armies and territorial expansion, lacking the geographic moats, sea access, and institutional stability required to become maritime 'whales' regardless of their ambitions.
What remains scarce after AGI? – Alex Imas and Phil Trammell
Alex Imas and Phil Trammell analyze what remains scarce after AGI, arguing that while a 'relational sector' where humans provide intrinsic value may persist, increasing variety in capital goods could cause labor share to collapse to zero unless we collect critical data on consumer preferences for human involvement.
Chip design from the bottom up – Reiner Pope
Reiner Pope explains how AI chips work from fundamental logic gates up, revealing that the physical cost of moving data between memory and compute units (via multiplexers) often exceeds the cost of the actual mathematical operations, and that circuit area scales quadratically with precision, making low-precision arithmetic exponentially more efficient than commonly assumed.