Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

| Podcasts | January 09, 2026 | 3.91 Thousand views | 1:18:15

TL;DR

Artificial Analysis co-founders George Cameron and Micah Hill-Smith detail their journey from a side project to becoming the presumptive independent standard for LLM evaluation, revealing how they maintain objectivity through 'mystery shopper' protocols while navigating the shift from an OpenAI-dominated market to a fragmented, globally competitive landscape.

🌱 Building an Independent Benchmarking Business 3 insights

From developer side project to sustainable enterprise

Started in 2023 while building a legal AI assistant to solve their own need for quality-cost-speed tradeoff data, initially running on personal funds with hundreds of dollars in compute costs before growing to 20+ employees funded by enterprise subscriptions and private benchmarking services.

Strict independence through dual revenue streams

No lab pays to appear on the public website; revenue comes instead from enterprise advisory subscriptions (covering deployment decisions like serverless vs. self-hosted) and private benchmarking for AI companies testing their own models.

The 'mystery shopper' integrity protocol

To prevent labs from manipulating results when providing private API endpoints, they register anonymous accounts on non-company domains to verify that public endpoints perform identically to preview versions, ensuring no special treatment.

🔬 The Science of Evaluation 3 insights

Controlling for benchmark gaming

Discovered labs used inconsistent prompting methodologies—such as Google using 32 unpublished chain-of-thought examples for Gemini Ultra to beat GPT-4 on MMLU—necessitating in-house standardized evaluation across all models.

Statistical rigor drives exponential costs

Moved beyond single-run evaluations to multiple repetitions achieving 95% confidence intervals (±1 point precision), contributing to non-linear cost increases as models and evaluation complexity grew beyond the 'hundreds of dollars' initial budget.

Addressing evaluation bias systematically

Implement sophisticated answer extraction to parse model outputs without penalizing reasoning for formatting errors, randomize multiple-choice answer order to eliminate position bias, and control for temperature variance that can create enormous score swings on small question sets.

📊 The Democratizing AI Frontier 3 insights

End of the OpenAI monopoly

Market structure shifted from OpenAI's year-long untouchable dominance to intense multi-polar competition, with DeepSeek's V3 release (Boxing Day 2024) proving non-US labs could reach frontier capabilities, followed weeks later by R1's reasoning breakthrough.

Benchmark saturation cycles

Intelligence Index evolved from V1 through V3 because tasks like HumanEval became trivial for modern models; current focus shifting to agentic capabilities, long-context reasoning, hallucination detection, and economically valuable use cases rather than academic Q&A.

Quality-per-dollar revolution

Progress in model efficiency means even small models today solve 100% of problems that required frontier models two years ago, driving down intelligence costs across all tiers while making raw capability scores less meaningful than task-specific performance.

Bottom Line

Treat vendor-reported benchmarks with extreme skepticism; decisions about AI infrastructure should rely on independent evaluation measuring the complete triangle of capability, latency, and cost under identical testing conditions, particularly as the market fragments beyond US frontier labs.

More from Latent Space

View all
Dreamer: the Agent OS for Everyone — David Singleton
1:04:23
Latent Space Latent Space

Dreamer: the Agent OS for Everyone — David Singleton

David Singleton introduces Dreamer as an 'Agent OS' that combines a personal AI Sidekick with a marketplace of tools and agents, enabling both non-technical users and engineers to build, customize, and deploy AI applications through natural language while maintaining privacy through centralized, OS-level architecture.

5 days ago · 9 points