Titans: Learning to Memorize at Test Time (Paper Analysis)
TL;DR
This analysis of Google's Titans paper explores an architecture that extends context windows by using a 2-layer MLP as a neural memory module that learns to compress and retrieve long-range information at test time, though the reviewer notes it reinvents some existing linear attention concepts while offering genuine innovation in adaptive memory.
📏 The Long Context Challenge 2 insights
Transformer context limitations
Standard transformers can only attend within fixed context windows, making them unable to process very long sequences like videos or extended documents without losing information from earlier segments.
Previous compression methods
Earlier approaches like Transformer XL passed compressed hidden states between chunks to act as memory, while linear transformers used kernel tricks to accumulate keys and values into matrix-valued states for efficient computation.
🧠 Neural Networks as Memory 3 insights
MLP replaces matrix memory
Titans replaces the matrix-valued memory of linear transformers with a 2-layer MLP that functions as a neural network memory module queried for distant past information.
Test-time learning mechanism
The memory "learns at test time" by updating its parameters during inference to compress and store information from tokens as they exit the local context window, creating an inner learning loop.
Dual attention architecture
The model combines standard local attention with queries to the neural memory, allowing it to retrieve relevant information from arbitrarily long sequences beyond the immediate context window.
⚖️ Critical Assessment 2 insights
Novelty versus marketing
The reviewer argues the paper presents a 50/50 split of genuine innovation and marketing, repackaging existing concepts like RNN hidden states and linear attention accumulation with new "memory" terminology.
Matrix compression debate
The paper's claim that matrix-valued states inherently limit performance is disputed; the limitation stems from poor kernel approximations rather than the compression itself.
Bottom Line
The key innovation is using a neural network with test-time parameter updates as a memory module, offering a flexible alternative to fixed matrix states for handling arbitrarily long contexts.
More from Yannic Kilcher
View all
Traditional X-Mas Stream
While streaming Minecraft gameplay, ML researcher Yannic Kilcher discusses how recursive self-improvement in AI faces practical exploration limits similar to reinforcement learning, and notes the field's shift from fundamental research to market-driven product development focused on coding and image generation applications.
TiDAR: Think in Diffusion, Talk in Autoregression (Paper Analysis)
TiDAR accelerates autoregressive LLM inference by utilizing idle GPU capacity during memory-bound phases to pre-draft future tokens via diffusion, then verifying them through autoregressive rejection sampling to maintain exact output quality without auxiliary model overhead.
[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)
The Free Transformer extends decoder architectures by introducing latent variables at the start of generation to capture global sequence decisions (like sentiment), replacing the implicit inference required by standard token-level sampling with explicit conditioning that simplifies learning and improves coherence.
More in AI & Machine Learning
View all
This picture broke my brain
This video unpacks M.C. Escher's "Print Gallery" lithograph, revealing how its paradoxical infinite loop relies on a conformal grid derived from complex analysis to transform a linear Droste effect into a continuous circular zoom, mathematically resolving the mysterious blank center.