Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt
TL;DR
Google's infrastructure lead Amin Vahdat argues that AI's critical constraint is not raw gigawatts but value delivered per unit of power, requiring a shift from five-nines reliability to throughput-optimized systems where 99.9% uptime is acceptable if it doubles usable capacity.
⚡ Value Over Capacity: The New Metric 2 insights
Measure daily active users per gigawatt, not gigawatts alone
With infrastructure costing $40–50 billion per gigawatt, the critical metric is business value—such as daily active users or revenue—delivered per dollar spent, rather than raw capacity deployed.
Treat idle capacity as a major outage
Google considers sub-96% node allocation a major outage because unused capacity is wasted capital; the focus must be on 'goodput' and ensuring every accelerator delivers value rather than sitting idle waiting for data or orchestration.
⚖️ System Balance and Amdahl's Law 2 insights
Flops without bandwidth are wasted money
Citing Amdahl's 1967 law (1 MIPS requires 1 MB/s I/O), Vahdat stresses that modern infrastructure must balance compute with HBM bandwidth, SRAM, and network capacity to prevent starvation and achieve true utilization.
Sparse models break hardware balance assumptions
The shift to mixture-of-experts architectures demands more memory bandwidth relative to compute than current hardware provides, explaining why some clusters operate at only 11% Model FLOPs Utilization (MFU).
🔄 The Reliability Paradigm Shift 2 insights
Frontier labs choose capacity over five-nines uptime
Unlike enterprise services requiring 99.999% reliability (30 seconds/year downtime), AI training workloads now prioritize throughput, accepting 99.9% reliability (3.65 days/year) in exchange for doubling usable capacity.
Synchronous training breaks loose coupling
Traditional distributed systems assumed fungible nodes where any rack could fail unnoticed, but synchronous AI training makes every accelerator special—if one fails, the entire job stops, requiring new failure management approaches.
Bottom Line
Optimize AI infrastructure for value per dollar by accepting 99.9% reliability to maximize usable capacity, while ensuring strict system balance between compute, memory, and networking to prevent expensive hardware from sitting idle.
More from Stanford Online
View all
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
This lecture explains how post-training transforms raw pre-trained models like GPT-3 into instruction-following systems like ChatGPT through supervised fine-tuning and reinforcement learning, emphasizing that high-quality data curation matters more than algorithmic sophistication.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
This lecture details the pre-training data pipeline, covering the transformation of raw HTML and PDFs into linear text and classifier-based filtering strategies to curate domain-specific datasets, while emphasizing the strategic trade-off between data quality and training duration.
Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Sachin Katti, OpenAI's head of industrial compute, details the infrastructure economics driving the AI supercycle, explaining how the company plans to scale to 30 gigawatts by 2030 while navigating the shift from training to inference-heavy agentic workloads and managing massive energy and supply chain constraints.