Inference, Diffusion, World Models, and More | YC Paper Club

| Business & Entrepreneurship | May 28, 2026 | 35.9 Thousand views | 1:07:19

TL;DR

At the inaugural YC Paper Club, Stanford researcher Tanishk presented Speculative Speculative Decoding (SSD), arguing that inference speed is becoming the primary constraint on AI capabilities rather than just a cost factor. The technique achieves 300 tokens per second on Llama 3 70B by parallelizing the drafting and verification steps of speculative decoding, effectively predicting verification outcomes to hide latency.

Inference as the Capability Bottleneck 2 insights

Inference speed determines peak intelligence

As AI systems scale via test-time compute and reasoning (like RL), tokens-per-second becomes the hard ceiling on deliverable intelligence, not merely an operational cost.

RL compute exceeds pre-training

Reinforcement learning already surpasses pre-training compute requirements, and since RL fundamentally wraps inference, generation efficiency is now the critical path for advancement.

🔄 Speculative Speculative Decoding (SSD) 3 insights

Parallelizing sequential dependencies

SSD eliminates the bottleneck where drafting must wait for verification by having the small draft model predict likely verification outcomes and begin drafting the next round while the large model verifies the current one.

Predicting the verifier's behavior

The algorithm predicts which tokens the large model will accept—including the bonus token—with 80-90% accuracy by analyzing the draft model's token distributions, allowing complete latency hiding.

Intelligent cache miss handling

Rather than naively falling back to standard speculation on mispredictions, SSD optimizes compute allocation across plausible prefix lengths to maximize hit rates and maintain speed.

📊 Performance Benchmarks 2 insights

300 tokens per second on large models

The SSD implementation achieves approximately 300 tokens per second on Llama 3 70B using only four H100s, significantly outperforming existing open-source inference engines like SGLang.

Throughput and latency improvements

Unlike vanilla speculative decoding which primarily reduces latency, SSD improves both latency and throughput simultaneously by fully overlapping draft and target model computation.

Bottom Line

Treat inference speed as a core capability metric rather than an operational cost, as parallelized speculation techniques demonstrate that faster generation directly enables more powerful reasoning.

More from Y Combinator

View all
How to Build Superintelligence Inside Your Company
46:30
Y Combinator Y Combinator

How to Build Superintelligence Inside Your Company

Y Combinator has transformed into an AI-native organization by building a shared 'organizational brain'—a centralized database and registry of 350+ internal tools that allows non-technical teams to encode workflows in English rather than code, moving beyond single-player coding agents to true organizational superintelligence.

3 days ago · 9 points
Why Good Companies Go Bad (And How to Stop It)
50:05
Y Combinator Y Combinator

Why Good Companies Go Bad (And How to Stop It)

Eric Ries exposes how standard corporate governance systematically destroys founder value, arguing that 'shareholder primacy' rules and expiring voting controls inevitably lead to mission drift, while offering frameworks for building enduring mission-controlled companies.

8 days ago · 6 points
Personal AI Is the New Personal Computer
41:30
Y Combinator Y Combinator

Personal AI Is the New Personal Computer

Y Combinator CEO Gary Tan details his return to software engineering after a 13-year hiatus, shipping hundreds of thousands of lines of code while running YC full-time by leveraging AI coding tools and developing "token maxing" methodologies that transform exhaustive research and development tasks into solo weekend projects.

22 days ago · 10 points
How Razorpay Became India’s Largest Payments Company
31:35
Y Combinator Y Combinator

How Razorpay Became India’s Largest Payments Company

Harshil Mathur recounts Razorpay's journey from a coding side project to India's largest payments platform, detailing their pivot from education to startups, the year-long regulatory wait that created competitive moats, and how surviving a bank crisis through radical customer transparency cemented their B2B trust foundation.

24 days ago · 9 points