Titans: Learning to Memorize at Test Time (Paper Analysis)

Yannic Kilcher

| AI & Machine Learning | December 14, 2025 | 23.6 Thousand views | 32:31

TL;DR

This analysis of Google's Titans paper explores an architecture that extends context windows by using a 2-layer MLP as a neural memory module that learns to compress and retrieve long-range information at test time, though the reviewer notes it reinvents some existing linear attention concepts while offering genuine innovation in adaptive memory.

📏 The Long Context Challenge 2 insights

Transformer context limitations

Standard transformers can only attend within fixed context windows, making them unable to process very long sequences like videos or extended documents without losing information from earlier segments.

Previous compression methods

Earlier approaches like Transformer XL passed compressed hidden states between chunks to act as memory, while linear transformers used kernel tricks to accumulate keys and values into matrix-valued states for efficient computation.

🧠 Neural Networks as Memory 3 insights

MLP replaces matrix memory

Titans replaces the matrix-valued memory of linear transformers with a 2-layer MLP that functions as a neural network memory module queried for distant past information.

Test-time learning mechanism

The memory "learns at test time" by updating its parameters during inference to compress and store information from tokens as they exit the local context window, creating an inner learning loop.

Dual attention architecture

The model combines standard local attention with queries to the neural memory, allowing it to retrieve relevant information from arbitrarily long sequences beyond the immediate context window.

⚖️ Critical Assessment 2 insights

Novelty versus marketing

The reviewer argues the paper presents a 50/50 split of genuine innovation and marketing, repackaging existing concepts like RNN hidden states and linear attention accumulation with new "memory" terminology.

Matrix compression debate

The paper's claim that matrix-valued states inherently limit performance is disputed; the limitation stems from poor kernel approximations rather than the compression itself.

Bottom Line

The key innovation is using a neural network with test-time parameter updates as a memory module, offering a flexible alternative to fixed matrix states for handling arbitrarily long contexts.

Watch on YouTube

More from Yannic Kilcher

Traditional X-Mas Stream

Yannic Kilcher

Traditional X-Mas Stream

While streaming Minecraft gameplay, ML researcher Yannic Kilcher discusses how recursive self-improvement in AI faces practical exploration limits similar to reinforcement learning, and notes the field's shift from fundamental research to market-driven product development focused on coding and image generation applications.

6 months ago · 6 points

TiDAR: Think in Diffusion, Talk in Autoregression (Paper Analysis)

Yannic Kilcher

TiDAR: Think in Diffusion, Talk in Autoregression (Paper Analysis)

TiDAR accelerates autoregressive LLM inference by utilizing idle GPU capacity during memory-bound phases to pre-draft future tokens via diffusion, then verifying them through autoregressive rejection sampling to maintain exact output quality without auxiliary model overhead.

6 months ago · 10 points

[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)

Yannic Kilcher

[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)

The Free Transformer extends decoder architectures by introducing latent variables at the start of generation to capture global sequence decisions (like sentiment), replacing the implicit inference required by standard token-level sampling with explicit conditioning that simplifies learning and improves coherence.

8 months ago · 8 points

More in AI & Machine Learning

Reinventing Entropy | Compression & Intelligence Part 1

3Blue1Brown

Reinventing Entropy | Compression & Intelligence Part 1

This video explains how Claude Shannon's information theory establishes fundamental limits on data compression through the concept of entropy, revealing that optimal compression produces random noise and that this mathematical framework underlies modern machine learning objectives like cross-entropy loss in large language models.

17 days ago · 8 points

Browse more: 🤖 AI & Machine Learning All Videos All Categories