Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

Stanford Online

| Podcasts | April 15, 2026 | 22.5 Thousand views | 1:29:14

TL;DR

This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.

🧮 Normalization Architecture Standards 3 insights

Pre-norm placement is now universal

Modern transformers universally place layer normalization outside the residual stream before attention and FFN blocks, unlike the original Vaswani paper, enabling cleaner gradient propagation and eliminating the need for training warm-up.

RMSNorm replaces LayerNorm for speed

All modern models use RMSNorm (without mean subtraction or bias) to minimize memory movement overhead, reducing runtime significantly despite normalization representing only 0.17% of FLOPs but up to 25% of execution time on small models.

Double normalization solves stability issues

Recent models like Grok, Gemma 2, and OLMo 2 apply layer norms both before and after computation blocks, following the empirical principle that sprinkling additional normalization throughout architectures resolves training instability.

📈 Historical Design Evolution 3 insights

Llama 2 created architectural convergence

After extensive experimentation through the GPT-3 era, Llama 2 established a de facto standard that unified the field before recent divergence toward stability optimizations and long-context capabilities.

Post-norm survives only as anomaly

OPT 350M represents the sole modern exception retaining residual-stream post-normalization, which correlates with its poor training stability compared to contemporary models that keep residual streams clean.

Gradient attenuation drives pre-norm adoption

Pre-norm architectures maintain consistent gradient magnitudes through deep networks by preserving straight-through propagation paths, while post-norm creates destabilizing gradient norm fluctuations that prevent stable training at scale.

⚙️ Systems and Implementation Constraints 2 insights

Arithmetic intensity dictates component choice

Architectural decisions prioritize GPU utilization and memory bandwidth over pure representational capacity, explaining why faster but less expressive operations like RMSNorm and SwiGLU dominate despite theoretical simplicity.

Stability requirements bake into architecture

Modern transformer designs embed training stability mechanisms directly into their structure rather than relying solely on optimization hyperparameters, as evidenced by mandatory shifts from ReLU to SwiGLU and sinusoidal embeddings to RoPE.

Bottom Line

Design transformers with pre-normalization and RMSNorm as non-negotiable foundations for stability and efficiency, and treat additional normalization layers as the primary tool for resolving training convergence issues.

Watch on YouTube

More from Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

Stanford Online

Stanford Robotics Seminar ENGR319 | Spring 2026 | Towards Trustworthy Autonomy

As learning-based robotics deploy at scale—exemplified by Waymo's 500,000 weekly rides—they face dangerous 'semantic anomalies' where context causes system-level confusion rather than visual novelty. The speaker presents a 'fast and slow' reasoning framework using lightweight embedding models for real-time detection and large language models for safety interventions, enabling trustworthy autonomy without requiring perfect prediction models.

11 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Coding AI

Vercel founder Guillermo Rauch explains how AI coding agents have expanded the software development market by 10-100x, driving a fundamental shift from traditional web services to 'agentic infrastructure' where tokens replace pixels as the primary commodity and deployment becomes the critical value creator.

25 days ago · 9 points

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Building AI Factories

Crusoe Energy CEO Chase Lockmiller explains how AI data centers represent history's second-largest infrastructure investment, driven by the economic potential of scalable 'digital labor.' He reveals Crusoe's strategy of building massive AI factories in stranded-power locations like Abilene, Texas, to overcome the industry's critical bottleneck: energized data center capacity.

about 1 month ago · 9 points

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Stanford Online

AI in Healthcare Series: Inside the Rise of AI in Healthcare, Open Evidence and Cyber Risks

Former U.S. Chief Data Scientist DJ Patil warns that healthcare systems are dangerously unprepared for AI-enabled cyberattacks from nation states, while simultaneously seeing rapid democratization of medical knowledge through tools like Open Evidence that are fundamentally reshaping the doctor-patient relationship.

about 1 month ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories