Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches

Stanford Online

| Podcasts | June 04, 2026 | 6.24 Thousand views | 1:22:31

TL;DR

Inference has emerged as the critical revenue-generating phase of AI, requiring engineers to treat serving as a full-stack discipline spanning applications to hardware, with precise workload definition being the foundation of profitable deployment.

💰 Inference as the Revenue Engine 3 insights

Training is a cost center, inference generates revenue

Organizations cannot monetize model weights directly and must convert them into served systems, making efficient inference the economic engine that funds continued model development.

Modern training pipelines depend on massive inference

Reinforcement learning and tool-use feedback loops now require generating model outputs at scale, potentially consuming more FLOPs than the pre-training phase itself.

Full-stack engineering from apps to electrons

Inference optimization uniquely spans application design, linear algebra, GPU kernel tuning, and physical hardware constraints including cooling and power delivery.

🎯 Three Application Archetypes 3 insights

Chatbot+: Strict human latency with tool integration

Interactive systems like Claude Code must respect human reaction times while using text outputs to trigger calls to external computer systems.

Background Agents: Async workflows with loose constraints

Systems like Devin or coding agents operate on minute-to-hour timelines where humans submit tasks and return later, tolerating significantly higher latency.

Data Processors: Bursty batch processing jobs

Document indexing and extraction workloads tolerate high latency but feature extremely bursty traffic patterns with long gaps between intense write spikes.

📊 Defining Workload Metrics 3 insights

Quantify via QPS and non-deterministic token counts

Measure queries per second alongside input and output token distributions, recognizing that output lengths are controlled by the model statistics rather than user input.

Leverage prefix reuse for aggressive KV caching

Caching previous computations converts GPU workload into storage reads, delivering significant cost savings particularly for latency-tolerant agent and batch workloads.

Optimize per-replica latency budgets, not aggregate throughput

Engineer for Time to First Token and Time Per Output Token on a per-user, per-replica basis because aggregate metrics mask the constraints that determine hardware requirements.

Bottom Line

Define inference workloads by per-replica, per-user latency budgets (TTFT and TPOT) and token distributions tailored to your application archetype, rather than optimizing for aggregate throughput alone.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

12 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

26 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories