Measuring Exponential Trends Rising (in AI) — Joel Becker, METR
TL;DR
METR (Model Evaluation and Threat Research) tracks AI capabilities using a 'time horizon' framework, revealing remarkably consistent exponential growth in task difficulty that models can solve, while noting recent jumps like Claude Opus 4.5 and the critical distinction between human-equivalent task difficulty and actual autonomous runtime.
🎯 METR's Dual Mission 2 insights
Capabilities and propensities evaluation
METR assesses what AI models can do (capabilities) and what they actually will do in deployment (propensities) to determine whether models pose catastrophic or existential risks to society.
Threat model evolution
The organization has shifted focus from autonomous replication threats (models self-replicating in the wild) toward R&D acceleration risks (capability explosions inside labs that could destabilize global security).
📈 The Time Horizon Framework 2 insights
Exponential trend discovery
METR's chart plotting task difficulty (measured in human hours required for 50% reliability) against time shows a remarkably straight exponential trend across multiple years and orders of magnitude of compute.
Human hours as difficulty metric
The metric represents task complexity via human-equivalent time rather than autonomous runtime, meaning models may complete 30-hour human tasks in minutes, and the chart tracks capability ceilings rather than execution speed.
🧪 Task Design and Constraints 2 insights
170-task evaluation suite
The benchmark suite spans atomic software actions, multi-hour autonomous challenges (HCAST), and novel ML research engineering (RE-bench), selected for economic relevance and automatic gradability.
Systematic exclusions
Tasks requiring computer vision, external real-world interactions ("messy" tasks), or implicit background knowledge not provided in prompts are excluded, potentially understating capabilities in unstructured deployment scenarios.
🚀 Recent Capability Acceleration 2 insights
Opus 4.5 discontinuity
Claude Opus 4.5 represented a significant upward break in the previously continuous time horizon trend, demonstrating capability improvements that converted skeptical engineers to heavy AI coding reliance.
Autonomy claims vs. evidence
While anecdotal reports of models running autonomously for hours proliferate, METR emphasizes these often lack scientific controls for cherry-picking and success thresholds, unlike their standardized 50% reliability evaluations.
Bottom Line
Organizations should prepare for exponential growth in AI task complexity but distinguish carefully between benchmark performance and reliable real-world autonomy, while prioritizing governance frameworks for near-term R&D acceleration risks over distant autonomous replication scenarios.
More from Latent Space
View all
🔬There Is No AlphaFold for Materials — AI for Materials Discovery with Heather Kulik
MIT professor Heather Kulik explains how AI discovered quantum phenomena to create 4x tougher polymers and why materials science lacks an 'AlphaFold' equivalent due to missing experimental datasets, emphasizing that domain expertise remains essential to validate AI predictions in chemistry.
Dreamer: the Agent OS for Everyone — David Singleton
David Singleton introduces Dreamer as an 'Agent OS' that combines a personal AI Sidekick with a marketplace of tools and agents, enabling both non-technical users and engineers to build, customize, and deploy AI applications through natural language while maintaining privacy through centralized, OS-level architecture.
Why Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork/Code
Anthropic's Felix Rieseberg explains why AI agents need their own virtual computers to be effective, arguing that confining Claude to chat interfaces severely limits capability. He details how this philosophy shaped Claude Cowork and why product development is shifting from lengthy planning to rapidly building multiple prototypes simultaneously.
⚡️Monty: the ultrafast Python interpreter by Agents for Agents — Samuel Colvin, Pydantic
Samuel Colvin from Pydantic introduces Monty, a Rust-based Python interpreter designed specifically for AI agents that achieves sub-microsecond execution latency by running in-process, bridging the gap between rigid tool calling and heavy containerized sandboxes.