Chip design from the bottom up – Reiner Pope

| Podcasts | May 22, 2026 | 25.8 Thousand views | 1:20:20

TL;DR

Reiner Pope explains how AI chips work from fundamental logic gates up, revealing that the physical cost of moving data between memory and compute units (via multiplexers) often exceeds the cost of the actual mathematical operations, and that circuit area scales quadratically with precision, making low-precision arithmetic exponentially more efficient than commonly assumed.

🔢 The Computational Primitive 2 insights

Multiply-accumulate drives AI workloads

Matrix multiplication, the core of AI inference and training, decomposes into repeated multiply-accumulate (MAC) operations where pairs of numbers are multiplied and added to a running sum.

Asymmetric precision prevents error accumulation

AI chips use lower precision for multiplication (e.g., 4-bit) than accumulation (e.g., 8-bit) because rounding errors compound during summation but not during single multiplication steps.

Physical Circuit Implementation 3 insights

AND gates generate partial products

A p-bit by q-bit multiplication requires p×q AND gates to produce all partial products by pairwise combining bits from both inputs.

Full adders compress bits efficiently

The Dadda multiplier algorithm uses full adders (3:2 compressors that sum three input bits into two output bits) to reduce the partial product grid, requiring exactly p×q full adders total.

Circuit area scales quadratically

Because both AND gates and full adders scale as p×q with bit width, halving precision from 8-bit to 4-bit theoretically yields 4x circuit efficiency, not merely 2x.

💰 The Hidden Cost of Data Movement 2 insights

Multiplexers dominate hardware costs

Reading from a register file requires multiplexers that cost approximately 3×n×p gates (for n registers of p bits), which for typical configurations exceeds the p×q cost of the actual multiply-accumulate unit.

Data movement exceeds computation expense

In the example discussed, moving data from an 8-entry register file to the ALU requires 24p gates versus only 4p gates for the 4-bit multiplication itself, making data movement 6x more expensive than the math.

🎯 Precision Tradeoffs in Modern Chips 2 insights

FP4 and FP8 require dedicated circuits

Unlike software abstractions, hardware cannot fungibly switch between precisions; designers must allocate fixed die area to FP4 and FP8 circuits based on expected workloads.

Nvidia acknowledges quadratic scaling

While historical GPUs showed 2x speedup when halving precision (linear scaling), Nvidia's B300 and newer chips report 3x speedup for FP4 vs FP8, moving toward the theoretical 4x efficiency dictated by quadratic area scaling.

Bottom Line

In AI chip design, optimizing data movement between memory and compute units delivers greater efficiency gains than optimizing the mathematical operations themselves, and reducing numerical precision yields quadratically better hardware utilization—making aggressive quantization strategies far more valuable than linear scaling suggests.

More from Dwarkesh Patel

View all
Building AlphaGo from scratch – Eric Jang
2:37:18
Dwarkesh Patel Dwarkesh Patel

Building AlphaGo from scratch – Eric Jang

Eric Jang demonstrates how modern LLM coding tools and algorithmic improvements have democratized AI research, enabling a single researcher to rebuild AlphaGo for thousands of dollars rather than millions, while explaining how Monte Carlo Tree Search combined with neural networks solved a game previously considered computationally intractable.

8 days ago · 9 points
The math behind how LLMs are trained and served – Reiner Pope
2:13:41
Dwarkesh Patel Dwarkesh Patel

The math behind how LLMs are trained and served – Reiner Pope

Reiner Pope explains the mathematical mechanics behind LLM inference costs, demonstrating how 'Fast Mode' APIs charge premiums for smaller batch sizes that reduce latency, and why physical memory bandwidth constraints create hard limits on how fast or cheap inference can get regardless of budget.

24 days ago · 9 points