Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Stanford Online

| Podcasts | April 27, 2026 | 1.86 Thousand views | 1:17:08

TL;DR

Albert Gu analyzes the fundamental tradeoffs between State Space Models (SSMs) and Transformers, framing SSMs as "brain-like" fixed-size state compressors that enable linear inference complexity versus Transformers' "database-like" KV cache approach that scales quadratically but enables precise retrieval.

🔄 The Rise of Linear Architectures 3 insights

Explosion of sub-quadratic alternatives

Since Mamba's release two years ago, the field has rapidly adopted linear-complexity architectures including Mamba 2/3, xLSTM, DeltaNet, and gated DeltaNet as production-viable alternatives to transformers.

Production-scale hybrid adoption

Major AI labs now deploy hybrid models (Jamba, Zamba, Samba, Qwen, Hunyuan, NeMo-Megatron) combining SSM layers with attention mechanisms, with several models scaled to hundreds of billions of parameters.

Convergent nomenclature

Terms like linear attention, modern RNNs, linear RNNs, and state space models now largely refer to the same family of input-dependent recurrent architectures with similar computational characteristics.

🧠 State Compression vs. Database Caching 3 insights

The KV cache bottleneck

Transformers function like expandable databases, maintaining a growing KV cache of every past token that enables precise pairwise comparisons but creates quadratic scaling in both memory and computation during inference.

Fixed-state compression paradigm

SSMs operate like brains, compressing all historical context into a fixed-size hidden state that remains constant regardless of sequence length, enabling linear time complexity and constant memory per generation step.

Architectural tradeoff fundamentals

The distinction between these approaches centers on what they store between generation steps: transformers cache raw tokens for exact lookup while SSMs maintain compressed summaries for efficient processing.

⚙️ Three Critical Ingredients for SSMs 3 insights

Expanded state dimensions

Modern SSMs expand input dimensions by 64-128x (state size of 64-128), creating a much wider information bottleneck than LSTMs to preserve critical information from dense modalities like language.

Input-dependent selectivity

Parameters become functions of the input itself (A and B matrices vary by token), allowing the model to dynamically control what information to remember or discard based on current context.

Parallel training algorithms

Efficient computation via associative scans (original Mamba) and chunked matrix multiplications (Mamba 2/DeltaNet) makes training these large-state models feasible despite their recurrent formulation.

🏆 Current Landscape and Recommendations 3 insights

Leading production variants

Mamba 2 and gated DeltaNet currently represent the most tried-and-true implementations, with gated DeltaNet offering greater modeling power at slightly reduced computational speed compared to Mamba 2.

Architectural convergence

Modern SSM variants share more structural similarities with each other than with attention mechanisms, differing primarily in specific parameterizations while maintaining the core linear-recurrent paradigm.

Framework for model selection

Choose Transformers for tasks requiring exact retrieval from long contexts and SSMs for efficient inference with compressed representations, with hybrid architectures offering practical middle-ground solutions.

Bottom Line

Select State Space Models for linear-inference efficiency and fixed memory footprint when tasks tolerate compressed context representations, but retain Transformers when precise retrieval from arbitrary past tokens is critical, with hybrid models emerging as the dominant production architecture.

Watch on YouTube

More from Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism

This lecture details how to scale language model training across massive clusters using 4D parallelism, contrasting TPU and GPU networking architectures while addressing the critical memory bottlenecks—particularly optimizer states—that dominate training costs at scale.

about 9 hours ago · 9 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

This lecture introduces distributed training for large language models, explaining how to scale beyond single-GPU memory limits using collective communication primitives (all-reduce, all-gather, reduce-scatter) across hardware topologies ranging from NVLink-connected single nodes to InfiniBand-linked multi-node clusters.

about 9 hours ago · 10 points

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA

This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.

about 9 hours ago · 9 points

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

Stanford Online

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

This lecture explains why high-dimensional pixel space (approximately 1 million dimensions for standard images) is computationally intractable for diffusion models, and how Variational Autoencoders (VAEs) solve this by compressing images into structured latent spaces that follow standard normal distributions, enabling efficient and meaningful generation.

about 9 hours ago · 7 points

Browse more: 🎙️ Podcasts All Videos All Categories