Measuring Exponential Trends Rising (in AI) — Joel Becker, METR

Latent Space

| Podcasts | February 27, 2026 | 5.57 Thousand views | 1:05:12

TL;DR

METR (Model Evaluation and Threat Research) tracks AI capabilities using a 'time horizon' framework, revealing remarkably consistent exponential growth in task difficulty that models can solve, while noting recent jumps like Claude Opus 4.5 and the critical distinction between human-equivalent task difficulty and actual autonomous runtime.

🎯 METR's Dual Mission 2 insights

Capabilities and propensities evaluation

METR assesses what AI models can do (capabilities) and what they actually will do in deployment (propensities) to determine whether models pose catastrophic or existential risks to society.

Threat model evolution

The organization has shifted focus from autonomous replication threats (models self-replicating in the wild) toward R&D acceleration risks (capability explosions inside labs that could destabilize global security).

📈 The Time Horizon Framework 2 insights

Exponential trend discovery

METR's chart plotting task difficulty (measured in human hours required for 50% reliability) against time shows a remarkably straight exponential trend across multiple years and orders of magnitude of compute.

Human hours as difficulty metric

The metric represents task complexity via human-equivalent time rather than autonomous runtime, meaning models may complete 30-hour human tasks in minutes, and the chart tracks capability ceilings rather than execution speed.

🧪 Task Design and Constraints 2 insights

170-task evaluation suite

The benchmark suite spans atomic software actions, multi-hour autonomous challenges (HCAST), and novel ML research engineering (RE-bench), selected for economic relevance and automatic gradability.

Systematic exclusions

Tasks requiring computer vision, external real-world interactions ("messy" tasks), or implicit background knowledge not provided in prompts are excluded, potentially understating capabilities in unstructured deployment scenarios.

🚀 Recent Capability Acceleration 2 insights

Opus 4.5 discontinuity

Claude Opus 4.5 represented a significant upward break in the previously continuous time horizon trend, demonstrating capability improvements that converted skeptical engineers to heavy AI coding reliance.

Autonomy claims vs. evidence

While anecdotal reports of models running autonomously for hours proliferate, METR emphasizes these often lack scientific controls for cherry-picking and success thresholds, unlike their standardized 50% reliability evaluations.

Bottom Line

Organizations should prepare for exponential growth in AI task complexity but distinguish carefully between benchmark performance and reliable real-world autonomy, while prioritizing governance frameworks for near-term R&D acceleration risks over distant autonomous replication scenarios.

Watch on YouTube

More from Latent Space

🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found

Latent Space

🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found

Physicist Alex Lubyansky discusses how GPT-5 and reasoning models like o3 have achieved superhuman capabilities in theoretical physics, solving the year-long mystery of single minus gluon tree amplitudes and reproducing complex research in minutes rather than months.

4 days ago · 9 points

The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition

Latent Space

The $15B Physical AI Company: Simulation, Autonomy OS, Neural Sim, & 1K Engineers—Applied Intuition

Applied Intuition is building the unified 'Android for physical machines' to solve OS fragmentation across vehicles and industrial equipment, enabling modern AI deployment through simulation tools, proprietary operating systems, and end-to-end autonomy models with a 1,000-engineer team.

12 days ago · 9 points

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

Latent Space

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

Shopify CTO Mikhail Parakhin reveals that AI agents have achieved nearly 100% daily adoption among developers, driving a 30% month-over-month surge in PR merges that is breaking traditional CI/CD pipelines, and argues that organizations must shift from parallel token-burning agents to high-latency, critique-loop architectures using expensive pro-level models for code review.

17 days ago · 10 points

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Latent Space

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Noetik is tackling the 95% failure rate of cancer clinical trials by training transformers on proprietary multimodal patient tumor data to identify hidden biological subtypes and match therapies to responsive populations, moving beyond simplistic biomarkers and outdated cell lines.

19 days ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories