Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches

| Podcasts | June 04, 2026 | 708 views | 1:22:31

TL;DR

Inference has emerged as the critical revenue-generating phase of AI, requiring engineers to treat serving as a full-stack discipline spanning applications to hardware, with precise workload definition being the foundation of profitable deployment.

💰 Inference as the Revenue Engine 3 insights

Training is a cost center, inference generates revenue

Organizations cannot monetize model weights directly and must convert them into served systems, making efficient inference the economic engine that funds continued model development.

Modern training pipelines depend on massive inference

Reinforcement learning and tool-use feedback loops now require generating model outputs at scale, potentially consuming more FLOPs than the pre-training phase itself.

Full-stack engineering from apps to electrons

Inference optimization uniquely spans application design, linear algebra, GPU kernel tuning, and physical hardware constraints including cooling and power delivery.

🎯 Three Application Archetypes 3 insights

Chatbot+: Strict human latency with tool integration

Interactive systems like Claude Code must respect human reaction times while using text outputs to trigger calls to external computer systems.

Background Agents: Async workflows with loose constraints

Systems like Devin or coding agents operate on minute-to-hour timelines where humans submit tasks and return later, tolerating significantly higher latency.

Data Processors: Bursty batch processing jobs

Document indexing and extraction workloads tolerate high latency but feature extremely bursty traffic patterns with long gaps between intense write spikes.

📊 Defining Workload Metrics 3 insights

Quantify via QPS and non-deterministic token counts

Measure queries per second alongside input and output token distributions, recognizing that output lengths are controlled by the model statistics rather than user input.

Leverage prefix reuse for aggressive KV caching

Caching previous computations converts GPU workload into storage reads, delivering significant cost savings particularly for latency-tolerant agent and batch workloads.

Optimize per-replica latency budgets, not aggregate throughput

Engineer for Time to First Token and Time Per Output Token on a per-user, per-replica basis because aggregate metrics mask the constraints that determine hardware requirements.

Bottom Line

Define inference workloads by per-replica, per-user latency budgets (TTFT and TPOT) and token distributions tailored to your application archetype, rather than optimizing for aggregate throughput alone.

More from Stanford Online

View all
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
1:04:40
Stanford Online Stanford Online

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Victoria Lynn from Thinking Machines Lab explains the evolution from language models to native multimodal AI systems, detailing how tokenization enables transformers to process images, audio, and video alongside text, while comparing discrete token approaches (Chameleon) against continuous diffusion-based methods (Transfusion) and their respective trade-offs in generation quality versus understanding capabilities.

about 9 hours ago · 9 points