Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
TL;DR
Inference has emerged as the critical revenue-generating phase of AI, requiring engineers to treat serving as a full-stack discipline spanning applications to hardware, with precise workload definition being the foundation of profitable deployment.
💰 Inference as the Revenue Engine 3 insights
Training is a cost center, inference generates revenue
Organizations cannot monetize model weights directly and must convert them into served systems, making efficient inference the economic engine that funds continued model development.
Modern training pipelines depend on massive inference
Reinforcement learning and tool-use feedback loops now require generating model outputs at scale, potentially consuming more FLOPs than the pre-training phase itself.
Full-stack engineering from apps to electrons
Inference optimization uniquely spans application design, linear algebra, GPU kernel tuning, and physical hardware constraints including cooling and power delivery.
🎯 Three Application Archetypes 3 insights
Chatbot+: Strict human latency with tool integration
Interactive systems like Claude Code must respect human reaction times while using text outputs to trigger calls to external computer systems.
Background Agents: Async workflows with loose constraints
Systems like Devin or coding agents operate on minute-to-hour timelines where humans submit tasks and return later, tolerating significantly higher latency.
Data Processors: Bursty batch processing jobs
Document indexing and extraction workloads tolerate high latency but feature extremely bursty traffic patterns with long gaps between intense write spikes.
📊 Defining Workload Metrics 3 insights
Quantify via QPS and non-deterministic token counts
Measure queries per second alongside input and output token distributions, recognizing that output lengths are controlled by the model statistics rather than user input.
Leverage prefix reuse for aggressive KV caching
Caching previous computations converts GPU workload into storage reads, delivering significant cost savings particularly for latency-tolerant agent and batch workloads.
Optimize per-replica latency budgets, not aggregate throughput
Engineer for Time to First Token and Time Per Output Token on a per-user, per-replica basis because aggregate metrics mask the constraints that determine hardware requirements.
Bottom Line
Define inference workloads by per-replica, per-user latency budgets (TTFT and TPOT) and token distributions tailored to your application archetype, rather than optimizing for aggregate throughput alone.
More from Stanford Online
View all
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Rob Platt argues that modern Vision-Language-Action models discard geometric structure, requiring massive datasets to relearn physical constraints. He proposes hybrid approaches that embed geometric symmetries (equivariance) directly into learning architectures, enabling data-efficient robot policies that respect physical laws.
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
This lecture introduces multimodal AI, explaining how transformers process images by converting them into semantic tokens and detailing the CLIP model's contrastive learning approach that aligns visual and textual embeddings to achieve zero-shot capabilities without curated datasets.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
This final lecture synthesizes the evolution of generative modeling from discrete diffusion to continuous flow matching, emphasizing that by 2026 flow matching—specifically rectified flow variants—has become the industry default for efficient image generation.