Inference, Diffusion, World Models, and More | YC Paper Club

Y Combinator

| Business & Entrepreneurship | May 28, 2026 | 121 Thousand views | 1:07:19

TL;DR

At the inaugural YC Paper Club, Stanford researcher Tanishk presented Speculative Speculative Decoding (SSD), arguing that inference speed is becoming the primary constraint on AI capabilities rather than just a cost factor. The technique achieves 300 tokens per second on Llama 3 70B by parallelizing the drafting and verification steps of speculative decoding, effectively predicting verification outcomes to hide latency.

⚡ Inference as the Capability Bottleneck 2 insights

Inference speed determines peak intelligence

As AI systems scale via test-time compute and reasoning (like RL), tokens-per-second becomes the hard ceiling on deliverable intelligence, not merely an operational cost.

RL compute exceeds pre-training

Reinforcement learning already surpasses pre-training compute requirements, and since RL fundamentally wraps inference, generation efficiency is now the critical path for advancement.

🔄 Speculative Speculative Decoding (SSD) 3 insights

Parallelizing sequential dependencies

SSD eliminates the bottleneck where drafting must wait for verification by having the small draft model predict likely verification outcomes and begin drafting the next round while the large model verifies the current one.

Predicting the verifier's behavior

The algorithm predicts which tokens the large model will accept—including the bonus token—with 80-90% accuracy by analyzing the draft model's token distributions, allowing complete latency hiding.

Intelligent cache miss handling

Rather than naively falling back to standard speculation on mispredictions, SSD optimizes compute allocation across plausible prefix lengths to maximize hit rates and maintain speed.

📊 Performance Benchmarks 2 insights

300 tokens per second on large models

The SSD implementation achieves approximately 300 tokens per second on Llama 3 70B using only four H100s, significantly outperforming existing open-source inference engines like SGLang.

Throughput and latency improvements

Unlike vanilla speculative decoding which primarily reduces latency, SSD improves both latency and throughput simultaneously by fully overlapping draft and target model computation.

Bottom Line

Treat inference speed as a core capability metric rather than an operational cost, as parallelized speculation techniques demonstrate that faster generation directly enables more powerful reasoning.

Watch on YouTube

More from Y Combinator

How A Prototype Built During A Missed Flight Became A New Gusto Product

Y Combinator

How A Prototype Built During A Missed Flight Became A New Gusto Product

Gusto co-founder Eddie Kim explains how a prototype built during a 5-hour airport delay evolved into Gusto Co-founder, an AI agent that automates repetitive small business tasks by leveraging existing customer data and simple chat interfaces rather than requiring technical expertise.

6 days ago · 8 points

India Can Create The Largest AI Companies

Y Combinator

India Can Create The Largest AI Companies

India is positioned to create the world's largest AI companies because the technology rewards deep technical expertise over local market knowledge, leveling the global playing field and allowing Indian founders to win enterprise customers through cold outreach and superior product merit rather than geographic proximity or networks.

17 days ago · 8 points

Zynga Founder: Consumer Is Not Investible Right Now - Thats Why You Should Build It

Y Combinator

Zynga Founder: Consumer Is Not Investible Right Now - Thats Why You Should Build It

Zynga founder Mark Pincus argues that while consumer startups are currently out of favor with investors, AI agents create unprecedented opportunities to reinvent everyday services. He shares his "Proven Better New" product framework and explains why founders must kill their ego to survive the inevitable failure of novel features.

19 days ago · 9 points

Why Domain Experts Are Winning Right Now

Y Combinator

Why Domain Experts Are Winning Right Now

Bryant Chou, co-founder of Webflow, demonstrates how his new startup Ploy enables domain experts to autonomously execute world-class marketing and web design, arguing that deep industry experience is becoming the ultimate competitive advantage for leveraging AI effectively.

25 days ago · 9 points

Browse more: 🚀 Business & Entrepreneurship All Videos All Categories