Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space

| Podcasts | January 09, 2026 | 3.91 Thousand views | 1:18:15

TL;DR

Artificial Analysis co-founders George Cameron and Micah Hill-Smith detail their journey from a side project to becoming the presumptive independent standard for LLM evaluation, revealing how they maintain objectivity through 'mystery shopper' protocols while navigating the shift from an OpenAI-dominated market to a fragmented, globally competitive landscape.

🌱 Building an Independent Benchmarking Business 3 insights

From developer side project to sustainable enterprise

Started in 2023 while building a legal AI assistant to solve their own need for quality-cost-speed tradeoff data, initially running on personal funds with hundreds of dollars in compute costs before growing to 20+ employees funded by enterprise subscriptions and private benchmarking services.

Strict independence through dual revenue streams

No lab pays to appear on the public website; revenue comes instead from enterprise advisory subscriptions (covering deployment decisions like serverless vs. self-hosted) and private benchmarking for AI companies testing their own models.

The 'mystery shopper' integrity protocol

To prevent labs from manipulating results when providing private API endpoints, they register anonymous accounts on non-company domains to verify that public endpoints perform identically to preview versions, ensuring no special treatment.

🔬 The Science of Evaluation 3 insights

Controlling for benchmark gaming

Discovered labs used inconsistent prompting methodologies—such as Google using 32 unpublished chain-of-thought examples for Gemini Ultra to beat GPT-4 on MMLU—necessitating in-house standardized evaluation across all models.

Statistical rigor drives exponential costs

Moved beyond single-run evaluations to multiple repetitions achieving 95% confidence intervals (±1 point precision), contributing to non-linear cost increases as models and evaluation complexity grew beyond the 'hundreds of dollars' initial budget.

Addressing evaluation bias systematically

Implement sophisticated answer extraction to parse model outputs without penalizing reasoning for formatting errors, randomize multiple-choice answer order to eliminate position bias, and control for temperature variance that can create enormous score swings on small question sets.

📊 The Democratizing AI Frontier 3 insights

End of the OpenAI monopoly

Market structure shifted from OpenAI's year-long untouchable dominance to intense multi-polar competition, with DeepSeek's V3 release (Boxing Day 2024) proving non-US labs could reach frontier capabilities, followed weeks later by R1's reasoning breakthrough.

Benchmark saturation cycles

Intelligence Index evolved from V1 through V3 because tasks like HumanEval became trivial for modern models; current focus shifting to agentic capabilities, long-context reasoning, hallucination detection, and economically valuable use cases rather than academic Q&A.

Quality-per-dollar revolution

Progress in model efficiency means even small models today solve 100% of problems that required frontier models two years ago, driving down intelligence costs across all tiers while making raw capability scores less meaningful than task-specific performance.

Bottom Line

Treat vendor-reported benchmarks with extreme skepticism; decisions about AI infrastructure should rely on independent evaluation measuring the complete triangle of capability, latency, and cost under identical testing conditions, particularly as the market fragments beyond US frontier labs.

Watch on YouTube

More from Latent Space

The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin

Latent Space

The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin

Matei Zaharia and Reynold Xin detail Databricks' open-source 'Agent Cloud' platform (Omnigen), arguing that standardized protocols and persistent infrastructure—not just better models—will determine which enterprises successfully deploy collaborative, secure AI agents at scale.

about 5 hours ago · 9 points

AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan

Latent Space

AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan

Gray Swan co-founders Zico Kolter and Matt Fredrikson explain why AI systems require a fundamentally different security approach than traditional software, highlighting how their automated red teaming system 'Shade' has begun to outperform human experts at finding model vulnerabilities. They emphasize the urgent need to treat AI agents as inherently untrusted entities capable of correlated failures across the software ecosystem.

2 days ago · 8 points

⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai

Latent Space

⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai

Ronak Malde explains leaving DeepMind (and $2 billion in acquisition earnings) to found Trajectory.ai, arguing that AI products must evolve from static tools into "living systems" that continually learn from real-world user corrections across enterprise verticals like legal and finance.

3 days ago · 9 points

The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP

Latent Space

The AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP

Anjney Midha argues that AI infrastructure is facing a crisis of inefficiency and cultural misalignment, proposing that compute be treated as a utility through an Independent System Operator model that pools multi-cloud resources while embedding community incentives directly into unit economics.

6 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories