Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Inference now dominates AI economics, with OpenAI generating 8.6 trillion tokens daily—exceeding frontier model training compute in under four days. Unlike training, autoregressive inference cannot parallelize across sequences, making it fundamentally memory-bandwidth bound rather than compute bound, with batch sizes under 295 on H100s failing to saturate GPU capacity.