Stanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu

Stanford Online

| Podcasts | June 05, 2026 | 13 Thousand views | 1:11:41

TL;DR

Dan Fu explains how LLM inference serves as the engine converting electricity into intelligence, detailing the lifecycle of requests through modern serving systems and emphasizing that GPU kernel expertise enables full-stack ML innovation.

📈 The Scale Revolution 2 insights

Dramatic parameter growth across generations

Language models scaled from 100 million parameters in 2018 to over 1 trillion today, with frontier models reaching 5-10 trillion parameters.

Rapid societal transition to AI adoption

AI adoption mirrors the 1902-1912 shift from horses to cars in Manhattan, with 2024 marking the inflection point where developers began writing majority of code using AI assistants.

⚙️ Inference Architecture 3 insights

Inference as the intelligence engine

While GPUs are the 'new oil,' inference engines and GPU kernels serve as the actual mechanism converting electricity into usable tokens and intelligence.

Complex end-to-end request lifecycle

Each request undergoes scheduling, KV cache lookup for computation reuse, distributed execution across GPUs, and safety checks before token output.

Prefill versus decode phase bottlenecks

Prefill processes input tokens in parallel and is compute-bound, while decode generates outputs sequentially and is memory bandwidth-bound.

📊 Production Workloads 3 insights

Diverse real-world workload characteristics

Coding agents process tens of thousands of input tokens with brief outputs, while chat and summarization tasks exhibit radically different computational patterns.

Continuous batching maximizes GPU utilization

Modern systems employ continuous batching to dynamically mix requests of varying lengths, processing short and long generations simultaneously to maximize throughput.

Navigating complex agentic workflow patterns

Multi-turn conversations with variable gaps between interactions, tool use iterations, and diverse latency requirements create traffic patterns distinct from training workloads.

Bottom Line

Mastering GPU kernels and inference engine architecture enables full-stack innovation in machine learning algorithms and optimizes the conversion of compute into intelligence.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

13 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

27 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories