Titans: Learning to Memorize at Test Time (Paper Analysis)

| AI & Machine Learning | December 14, 2025 | 22 Thousand views | 32:31

TL;DR

This analysis of Google's Titans paper explores an architecture that extends context windows by using a 2-layer MLP as a neural memory module that learns to compress and retrieve long-range information at test time, though the reviewer notes it reinvents some existing linear attention concepts while offering genuine innovation in adaptive memory.

📏 The Long Context Challenge 2 insights

Transformer context limitations

Standard transformers can only attend within fixed context windows, making them unable to process very long sequences like videos or extended documents without losing information from earlier segments.

Previous compression methods

Earlier approaches like Transformer XL passed compressed hidden states between chunks to act as memory, while linear transformers used kernel tricks to accumulate keys and values into matrix-valued states for efficient computation.

🧠 Neural Networks as Memory 3 insights

MLP replaces matrix memory

Titans replaces the matrix-valued memory of linear transformers with a 2-layer MLP that functions as a neural network memory module queried for distant past information.

Test-time learning mechanism

The memory "learns at test time" by updating its parameters during inference to compress and store information from tokens as they exit the local context window, creating an inner learning loop.

Dual attention architecture

The model combines standard local attention with queries to the neural memory, allowing it to retrieve relevant information from arbitrarily long sequences beyond the immediate context window.

⚖️ Critical Assessment 2 insights

Novelty versus marketing

The reviewer argues the paper presents a 50/50 split of genuine innovation and marketing, repackaging existing concepts like RNN hidden states and linear attention accumulation with new "memory" terminology.

Matrix compression debate

The paper's claim that matrix-valued states inherently limit performance is disputed; the limitation stems from poor kernel approximations rather than the compression itself.

Bottom Line

The key innovation is using a neural network with test-time parameter updates as a memory module, offering a flexible alternative to fixed matrix states for handling arbitrarily long contexts.

More from Yannic Kilcher

View all
Traditional X-Mas Stream
2:33:37
Yannic Kilcher Yannic Kilcher

Traditional X-Mas Stream

While streaming Minecraft gameplay, ML researcher Yannic Kilcher discusses how recursive self-improvement in AI faces practical exploration limits similar to reinforcement learning, and notes the field's shift from fundamental research to market-driven product development focused on coding and image generation applications.

3 months ago · 6 points
TiDAR: Think in Diffusion, Talk in Autoregression (Paper Analysis)
47:02
Yannic Kilcher Yannic Kilcher

TiDAR: Think in Diffusion, Talk in Autoregression (Paper Analysis)

TiDAR accelerates autoregressive LLM inference by utilizing idle GPU capacity during memory-bound phases to pre-draft future tokens via diffusion, then verifying them through autoregressive rejection sampling to maintain exact output quality without auxiliary model overhead.

3 months ago · 10 points
[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)
40:10
Yannic Kilcher Yannic Kilcher

[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)

The Free Transformer extends decoder architectures by introducing latent variables at the start of generation to capture global sequence decisions (like sentiment), replacing the implicit inference required by standard token-level sampling with explicit conditioning that simplifies learning and improves coherence.

5 months ago · 8 points

More in AI & Machine Learning

View all
This picture broke my brain
44:52
3Blue1Brown 3Blue1Brown

This picture broke my brain

This video unpacks M.C. Escher's "Print Gallery" lithograph, revealing how its paradoxical infinite loop relies on a conformal grid derived from complex analysis to transform a linear Droste effect into a continuous circular zoom, mathematically resolving the mysterious blank center.

3 days ago · 9 points