Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith
TL;DR
Artificial Analysis co-founders George Cameron and Micah Hill-Smith detail their journey from a side project to becoming the presumptive independent standard for LLM evaluation, revealing how they maintain objectivity through 'mystery shopper' protocols while navigating the shift from an OpenAI-dominated market to a fragmented, globally competitive landscape.
🌱 Building an Independent Benchmarking Business 3 insights
From developer side project to sustainable enterprise
Started in 2023 while building a legal AI assistant to solve their own need for quality-cost-speed tradeoff data, initially running on personal funds with hundreds of dollars in compute costs before growing to 20+ employees funded by enterprise subscriptions and private benchmarking services.
Strict independence through dual revenue streams
No lab pays to appear on the public website; revenue comes instead from enterprise advisory subscriptions (covering deployment decisions like serverless vs. self-hosted) and private benchmarking for AI companies testing their own models.
The 'mystery shopper' integrity protocol
To prevent labs from manipulating results when providing private API endpoints, they register anonymous accounts on non-company domains to verify that public endpoints perform identically to preview versions, ensuring no special treatment.
🔬 The Science of Evaluation 3 insights
Controlling for benchmark gaming
Discovered labs used inconsistent prompting methodologies—such as Google using 32 unpublished chain-of-thought examples for Gemini Ultra to beat GPT-4 on MMLU—necessitating in-house standardized evaluation across all models.
Statistical rigor drives exponential costs
Moved beyond single-run evaluations to multiple repetitions achieving 95% confidence intervals (±1 point precision), contributing to non-linear cost increases as models and evaluation complexity grew beyond the 'hundreds of dollars' initial budget.
Addressing evaluation bias systematically
Implement sophisticated answer extraction to parse model outputs without penalizing reasoning for formatting errors, randomize multiple-choice answer order to eliminate position bias, and control for temperature variance that can create enormous score swings on small question sets.
📊 The Democratizing AI Frontier 3 insights
End of the OpenAI monopoly
Market structure shifted from OpenAI's year-long untouchable dominance to intense multi-polar competition, with DeepSeek's V3 release (Boxing Day 2024) proving non-US labs could reach frontier capabilities, followed weeks later by R1's reasoning breakthrough.
Benchmark saturation cycles
Intelligence Index evolved from V1 through V3 because tasks like HumanEval became trivial for modern models; current focus shifting to agentic capabilities, long-context reasoning, hallucination detection, and economically valuable use cases rather than academic Q&A.
Quality-per-dollar revolution
Progress in model efficiency means even small models today solve 100% of problems that required frontier models two years ago, driving down intelligence costs across all tiers while making raw capability scores less meaningful than task-specific performance.
Bottom Line
Treat vendor-reported benchmarks with extreme skepticism; decisions about AI infrastructure should rely on independent evaluation measuring the complete triangle of capability, latency, and cost under identical testing conditions, particularly as the market fragments beyond US frontier labs.
More from Latent Space
View all
The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin
Matei Zaharia and Reynold Xin detail Databricks' open-source 'Agent Cloud' platform (Omnigen), arguing that standardized protocols and persistent infrastructure—not just better models—will determine which enterprises successfully deploy collaborative, secure AI agents at scale.
AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan
Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.
⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai
Ronak Malde explains leaving DeepMind (and $2 billion in acquisition earnings) to found Trajectory.ai, arguing that AI products must evolve from static tools into "living systems" that continually learn from real-world user corrections across enterprise verticals like legal and finance.
The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.