Fast Models Need Slow Developers — Sarah Chieng, Cerebras

AI Engineer

| Podcasts | May 22, 2026 | 10.3 Thousand views

TL;DR

As AI coding models like Codex Spark reach 1,200 tokens per second—20x faster than current standards—developers must abandon bad habits formed during the era of slow inference. This talk outlines a practical playbook for "slow development": orchestrating fast models for execution while using slower, smarter models for planning, and treating AI as a real-time pair programmer requiring constant verification and strict context management.

⚡ The Infrastructure Behind the Speed Surge 3 insights

Hardware breaks the memory wall

New architectures like Cerebras' wafer-scale engine use on-chip SRAM instead of off-chip HBM to eliminate memory bandwidth bottlenecks, while disaggregated inference separates compute-bound prefill from memory-bound decode onto specialized hardware.

Stack-wide optimizations compound

Efficiency gains come simultaneously from model architectures like Mixture of Experts (MoE) and pruning, plus inference techniques like KV cache reuse that minimize redundant computations.

The 20x danger multiplier

Without changing habits developed for 50 token/sec models, developers will generate massive amounts of unverified technical debt 20 times faster, turning agent swarms into instant spaghetti code.

🎯 Orchestrating the Fast and the Slow 3 insights

Strategic model pairing

Use larger, more intelligent models for complex planning and long-horizon workflows, then deploy fast models like Codex Spark as pure executors for sub-tasks to maximize both quality and speed.

Codify success into skills

Capture successful AI trajectories as reusable "skills" using slower planning models, then have fast agents execute these verified patterns autonomously in the background.

Cherry-picking induces taste

Leverage extreme speed to generate 15 to 75 variations of UI or design elements simultaneously, then manually select the best to artificially inject "taste" that models lack without exhaustive prompt engineering.

🧠 Real-Time Collaboration & Context Discipline 3 insights

Shift from batch to interactive

Treat fast models as real-time pair programmers; sit with the code, ask questions, and actively steer implementation rather than spawning agents and walking away.

Validation becomes free

At 1,200 tokens/sec, exhaustive validation—test suites, linting, diff reviews, and browser QA—should be baked into every step instead of deferred to pre-commit.

Externalize memory immediately

With context windows filling 20x faster (compaction in 30 seconds vs. 10 minutes), break tasks into bounded goals and use persistent files (agents.md, plan.md, progress.md, verify.md) to maintain state across sessions.

Bottom Line

Adopt a "slow developer" mindset by using fast models for execution only under tight human supervision and strict constraints, while externalizing context to prevent information loss in high-speed sessions.

Watch on YouTube

More from AI Engineer

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

8 days ago · 10 points

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

8 days ago · 9 points

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

8 days ago · 10 points

Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy

AI Engineer

Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy

Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.

9 days ago · 10 points

Browse more: 🎙️ Podcasts All Videos All Categories