How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
TL;DR
Joel Becker from METR argues that slowing compute growth would proportionally delay AI capabilities milestones measured by task time horizons, while presenting findings that experienced open-source developers showed minimal productivity gains from AI coding assistants like Cursor, challenging optimistic adoption curves.
📈 Compute Scaling & AI Timelines 3 insights
Compute-time horizon proportionality causes milestone delays
If compute growth slows by half, time horizon growth slows proportionally, potentially causing enormous delays in reaching AI milestones like automating one-month tasks.
Physical and economic constraints threaten compute growth
Power constraints and spending limits for large tech companies and nation states may bend the compute curve downward after 2030, directly impacting capability advancement speed.
Proportionality holds absent software-only singularity
This causal relationship between compute and time horizons persists only until a software singularity or unpredictable architectural breakthrough decouples software improvements from hardware scaling.
💻 Developer Productivity Findings 3 insights
Experienced developers show negligible Cursor speedup
A study of 16 experienced open-source developers using Cursor found minimal productivity gains, contradicting assumptions that AI tools automatically accelerate professional workflows.
Self-reported time estimates prove consistently unreliable
Developers consistently misestimate absolute time spent on tasks despite accurately reporting relative productivity feelings, making time-based surveys unreliable for capability forecasting.
Familiarity with tools shows minimal explanatory power
While Meta observed a J-curve with AI tool adoption, METR found no evidence that Cursor familiarity explained the null results among developers already experienced with LLMs.
🏗️ Evaluation Context & Limitations 3 insights
AI excels on legacy over open-source code
AI assistants demonstrate greater utility on disorganized legacy codebases lacking documentation compared to well-structured open-source projects optimized for human navigation.
Doubling time horizons break evaluation feasibility
As AI time horizons double, evaluation tasks eventually exceed feasible human monitoring periods, potentially breaking the metric's usefulness before maximum capabilities are reached.
Capability constraints outweigh human learning curves
The barrier to developer speedup appears rooted in fundamental AI capability limits rather than temporary human adoption friction or suboptimal prompting strategies.
Bottom Line
AI capability forecasting must account for potential compute constraints causing proportional delays in long-horizon task automation, while current evidence suggests experienced developers face fundamental capability limits with AI coding tools rather than temporary adoption friction.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.