Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures
TL;DR
This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.
🧮 Normalization Architecture Standards 3 insights
Pre-norm placement is now universal
Modern transformers universally place layer normalization outside the residual stream before attention and FFN blocks, unlike the original Vaswani paper, enabling cleaner gradient propagation and eliminating the need for training warm-up.
RMSNorm replaces LayerNorm for speed
All modern models use RMSNorm (without mean subtraction or bias) to minimize memory movement overhead, reducing runtime significantly despite normalization representing only 0.17% of FLOPs but up to 25% of execution time on small models.
Double normalization solves stability issues
Recent models like Grok, Gemma 2, and OLMo 2 apply layer norms both before and after computation blocks, following the empirical principle that sprinkling additional normalization throughout architectures resolves training instability.
📈 Historical Design Evolution 3 insights
Llama 2 created architectural convergence
After extensive experimentation through the GPT-3 era, Llama 2 established a de facto standard that unified the field before recent divergence toward stability optimizations and long-context capabilities.
Post-norm survives only as anomaly
OPT 350M represents the sole modern exception retaining residual-stream post-normalization, which correlates with its poor training stability compared to contemporary models that keep residual streams clean.
Gradient attenuation drives pre-norm adoption
Pre-norm architectures maintain consistent gradient magnitudes through deep networks by preserving straight-through propagation paths, while post-norm creates destabilizing gradient norm fluctuations that prevent stable training at scale.
⚙️ Systems and Implementation Constraints 2 insights
Arithmetic intensity dictates component choice
Architectural decisions prioritize GPU utilization and memory bandwidth over pure representational capacity, explaining why faster but less expressive operations like RMSNorm and SwiGLU dominate despite theoretical simplicity.
Stability requirements bake into architecture
Modern transformer designs embed training stability mechanisms directly into their structure rather than relying solely on optimization hyperparameters, as evidenced by mandatory shifts from ReLU to SwiGLU and sinusoidal embeddings to RoPE.
Bottom Line
Design transformers with pre-normalization and RMSNorm as non-negotiable foundations for stability and efficiency, and treat additional normalization layers as the primary tool for resolving training convergence issues.
More from Stanford Online
View all
Stanford CS547 HCI Seminar | Spring 2026 | Reading Games Well
Tracy Fullerton presents a framework for understanding games not as static technical artifacts but as ephemeral emotional events created through the player's unique encounter with the work, introducing 'readings' as a method to capture and value these personal experiences with the same critical depth applied to literature and film.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)
This lecture covers resource accounting fundamentals for training large language models, including FLOPs calculations and memory estimation, explores numerical precision trade-offs from FP32 down to FP4, and introduces einops as a readable alternative to PyTorch tensor operations using named dimensions.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization
This lecture introduces Stanford CS336's philosophy of building language models from scratch to understand fundamentals rather than relying on abstractions, addressing how researchers can navigate the disconnect caused by industrialized, closed frontier models by focusing on transferable mechanics and efficiency-minded mindsets.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
This lecture introduces score matching as an alternative to DDPM for generative modeling, where instead of predicting noise directly, models estimate the gradient of log probability density (the 'score') to guide sampling from noise toward data distributions using Langevin dynamics.