Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures
TL;DR
This lecture surveys modern transformer architecture evolution by analyzing 19+ recent dense language models, revealing universal adoption of pre-normalization and RMSNorm for training stability and hardware efficiency, while tracing the field's shift from post-GPT-3 experimentation to Llama 2 convergence and recent divergence toward stability-focused designs.
🧮 Normalization Architecture Standards 3 insights
Pre-norm placement is now universal
Modern transformers universally place layer normalization outside the residual stream before attention and FFN blocks, unlike the original Vaswani paper, enabling cleaner gradient propagation and eliminating the need for training warm-up.
RMSNorm replaces LayerNorm for speed
All modern models use RMSNorm (without mean subtraction or bias) to minimize memory movement overhead, reducing runtime significantly despite normalization representing only 0.17% of FLOPs but up to 25% of execution time on small models.
Double normalization solves stability issues
Recent models like Grok, Gemma 2, and OLMo 2 apply layer norms both before and after computation blocks, following the empirical principle that sprinkling additional normalization throughout architectures resolves training instability.
📈 Historical Design Evolution 3 insights
Llama 2 created architectural convergence
After extensive experimentation through the GPT-3 era, Llama 2 established a de facto standard that unified the field before recent divergence toward stability optimizations and long-context capabilities.
Post-norm survives only as anomaly
OPT 350M represents the sole modern exception retaining residual-stream post-normalization, which correlates with its poor training stability compared to contemporary models that keep residual streams clean.
Gradient attenuation drives pre-norm adoption
Pre-norm architectures maintain consistent gradient magnitudes through deep networks by preserving straight-through propagation paths, while post-norm creates destabilizing gradient norm fluctuations that prevent stable training at scale.
⚙️ Systems and Implementation Constraints 2 insights
Arithmetic intensity dictates component choice
Architectural decisions prioritize GPU utilization and memory bandwidth over pure representational capacity, explaining why faster but less expressive operations like RMSNorm and SwiGLU dominate despite theoretical simplicity.
Stability requirements bake into architecture
Modern transformer designs embed training stability mechanisms directly into their structure rather than relying solely on optimization hyperparameters, as evidenced by mandatory shifts from ReLU to SwiGLU and sinusoidal embeddings to RoPE.
Bottom Line
Design transformers with pre-normalization and RMSNorm as non-negotiable foundations for stability and efficiency, and treat additional normalization layers as the primary tool for resolving training convergence issues.
More from Stanford Online
View all
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
This final lecture synthesizes the evolution of generative modeling from discrete diffusion to continuous flow matching, emphasizing that by 2026 flow matching—specifically rectified flow variants—has become the industry default for efficient image generation.
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
This Stanford lecture establishes aesthetics and prompt adherence as the dual pillars for evaluating text-to-image models, compares human evaluation methods from noisy absolute ratings to reliable pairwise comparisons, and details the ELO rating system for robust model benchmarking before addressing the scalability crisis that necessitates automated metrics.
Stanford CS153 Frontier Systems | The Road Ahead: Resilience Required
Former federal prosecutor and tech security chief Joe Sullivan recounts his journey from prosecuting cybercrime to leading security at eBay, Facebook, Uber, and Cloudflare, sharing hard-won lessons on the critical importance of transparency in security incidents through the lens of his personal prosecution for the 2016 Uber data breach cover-up.
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
This lecture explains why RLHF hits overoptimization limits with learned reward models, and how RLVR (Reinforcement Learning from Verifiable Rewards) enables unlimited compute scaling on verifiable tasks like math and coding through simpler algorithms like GRPO.