Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

| Podcasts | April 15, 2026 | 4.36 Thousand views | 1:29:14

TL;DR

This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.

🧮 Normalization Architecture Standards 3 insights

Pre-norm placement is now universal

Modern transformers universally place layer normalization outside the residual stream before attention and FFN blocks, unlike the original Vaswani paper, enabling cleaner gradient propagation and eliminating the need for training warm-up.

RMSNorm replaces LayerNorm for speed

All modern models use RMSNorm (without mean subtraction or bias) to minimize memory movement overhead, reducing runtime significantly despite normalization representing only 0.17% of FLOPs but up to 25% of execution time on small models.

Double normalization solves stability issues

Recent models like Grok, Gemma 2, and OLMo 2 apply layer norms both before and after computation blocks, following the empirical principle that sprinkling additional normalization throughout architectures resolves training instability.

📈 Historical Design Evolution 3 insights

Llama 2 created architectural convergence

After extensive experimentation through the GPT-3 era, Llama 2 established a de facto standard that unified the field before recent divergence toward stability optimizations and long-context capabilities.

Post-norm survives only as anomaly

OPT 350M represents the sole modern exception retaining residual-stream post-normalization, which correlates with its poor training stability compared to contemporary models that keep residual streams clean.

Gradient attenuation drives pre-norm adoption

Pre-norm architectures maintain consistent gradient magnitudes through deep networks by preserving straight-through propagation paths, while post-norm creates destabilizing gradient norm fluctuations that prevent stable training at scale.

⚙️ Systems and Implementation Constraints 2 insights

Arithmetic intensity dictates component choice

Architectural decisions prioritize GPU utilization and memory bandwidth over pure representational capacity, explaining why faster but less expressive operations like RMSNorm and SwiGLU dominate despite theoretical simplicity.

Stability requirements bake into architecture

Modern transformer designs embed training stability mechanisms directly into their structure rather than relying solely on optimization hyperparameters, as evidenced by mandatory shifts from ReLU to SwiGLU and sinusoidal embeddings to RoPE.

Bottom Line

Design transformers with pre-normalization and RMSNorm as non-negotiable foundations for stability and efficiency, and treat additional normalization layers as the primary tool for resolving training convergence issues.

More from Stanford Online

View all
Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well
59:42
Stanford Online Stanford Online

Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well

Tracy Fullerton presents a framework for understanding games not as static technical artifacts but as ephemeral emotional events created through the player's unique encounter with the work, introducing 'readings' as a method to capture and value these personal experiences with the same critical depth applied to literature and film.

4 days ago · 9 points