Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
TL;DR
Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.
📈 The Scale Revolution 2 insights
Dramatic parameter growth across generations
Language models scaled from 100 million parameters in 2018 to over 1 trillion today, with frontier models reaching 5-10 trillion parameters.
Rapid societal transition to AI adoption
AI adoption mirrors the 1902-1912 shift from horses to cars in Manhattan, with 2024 marking the inflection point where developers began writing majority of code using AI assistants.
⚙️ Inference Architecture 3 insights
Inference as the intelligence engine
While GPUs are the 'new oil,' inference engines and GPU kernels serve as the actual mechanism converting electricity into usable tokens and intelligence.
Complex end-to-end request lifecycle
Each request undergoes scheduling, KV cache lookup for computation reuse, distributed execution across GPUs, and safety checks before token output.
Prefill versus decode phase bottlenecks
Prefill processes input tokens in parallel and is compute-bound, while decode generates outputs sequentially and is memory bandwidth-bound.
📊 Production Workloads 3 insights
Diverse real-world workload characteristics
Coding agents process tens of thousands of input tokens with brief outputs, while chat and summarization tasks exhibit radically different computational patterns.
Continuous batching maximizes GPU utilization
Modern systems employ continuous batching to dynamically mix requests of varying lengths, processing short and long generations simultaneously to maximize throughput.
Navigating complex agentic workflow patterns
Multi-turn conversations with variable gaps between interactions, tool use iterations, and diverse latency requirements create traffic patterns distinct from training workloads.
Bottom Line
Mastering GPU kernels and inference engine architecture enables full-stack innovation in machine learning algorithms and optimizes the conversion of compute into intelligence.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
The speaker challenges the game industry's outdated assumption that players primarily seek competition, presenting 2024 data showing only 18% of gamers are motivated by competition while 50% seek stress relief and 40% want community. They introduce a framework of nine motivators divided into classic (Fun, Mastery, Competition, Immersion, Meditation, Comfort) and modern (Self-expression, Companionship, Education), arguing that successful games must layer social and creative motivators onto traditional designs to serve contemporary player needs.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
Base 10 CEO Tuhin explains why AI inference is shifting from frontier models to custom post-trained models as companies scale, driven by 70-90% cost savings, latency requirements, and the strategic need to own proprietary data rather than feed it to potential competitors.
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.