Fast Models Need Slow Developers — Sarah Chieng, Cerebras

| Podcasts | May 22, 2026 | 2.86 Thousand views

TL;DR

As AI coding models like Codex Spark reach 1,200 tokens per second—20x faster than current standards—developers must abandon bad habits formed during the era of slow inference. This talk outlines a practical playbook for "slow development": orchestrating fast models for execution while using slower, smarter models for planning, and treating AI as a real-time pair programmer requiring constant verification and strict context management.

The Infrastructure Behind the Speed Surge 3 insights

Hardware breaks the memory wall

New architectures like Cerebras' wafer-scale engine use on-chip SRAM instead of off-chip HBM to eliminate memory bandwidth bottlenecks, while disaggregated inference separates compute-bound prefill from memory-bound decode onto specialized hardware.

Stack-wide optimizations compound

Efficiency gains come simultaneously from model architectures like Mixture of Experts (MoE) and pruning, plus inference techniques like KV cache reuse that minimize redundant computations.

The 20x danger multiplier

Without changing habits developed for 50 token/sec models, developers will generate massive amounts of unverified technical debt 20 times faster, turning agent swarms into instant spaghetti code.

🎯 Orchestrating the Fast and the Slow 3 insights

Strategic model pairing

Use larger, more intelligent models for complex planning and long-horizon workflows, then deploy fast models like Codex Spark as pure executors for sub-tasks to maximize both quality and speed.

Codify success into skills

Capture successful AI trajectories as reusable "skills" using slower planning models, then have fast agents execute these verified patterns autonomously in the background.

Cherry-picking induces taste

Leverage extreme speed to generate 15 to 75 variations of UI or design elements simultaneously, then manually select the best to artificially inject "taste" that models lack without exhaustive prompt engineering.

🧠 Real-Time Collaboration & Context Discipline 3 insights

Shift from batch to interactive

Treat fast models as real-time pair programmers; sit with the code, ask questions, and actively steer implementation rather than spawning agents and walking away.

Validation becomes free

At 1,200 tokens/sec, exhaustive validation—test suites, linting, diff reviews, and browser QA—should be baked into every step instead of deferred to pre-commit.

Externalize memory immediately

With context windows filling 20x faster (compaction in 30 seconds vs. 10 minutes), break tasks into bounded goals and use persistent files (agents.md, plan.md, progress.md, verify.md) to maintain state across sessions.

Bottom Line

Adopt a "slow developer" mindset by using fast models for execution only under tight human supervision and strict constraints, while externalizing context to prevent information loss in high-speed sessions.

More from AI Engineer

View all
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
1:17:14
AI Engineer AI Engineer

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Guillaume Vernade from Google DeepMind demonstrates how to build multimodal content pipelines using the new GenMedia suite (Nano Banana 2, Veo 3.1, and Lyria) via the Gemini Developer API, showcasing a live workshop that transforms text into illustrated books with AI-generated images, video, and music.

5 days ago · 10 points
Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
1:15:40
AI Engineer AI Engineer

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Anthropic engineers Ash Prabakar and Andrew Wilson explain how to build AI agents that run for hours or days by combining model improvements with strategic 'harness' scaffolding that solves context limitations, planning failures, and unreliable self-evaluation through persistent state management, verification loops, and deterministic orchestration patterns.

5 days ago · 9 points
Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft
AI Engineer AI Engineer

Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft

Marlene Mhangami presents data showing GitHub code creation accelerating to 14 billion projected commits in 2026, driven by AI agents. She argues that true productivity gains require clean codebases and advocates for behavior-driven test development using Playwright with AI agents, where developers focus on refactoring while AI handles test generation and initial code implementation.

7 days ago · 10 points