Chip design from the bottom up – Reiner Pope
TL;DR
Reiner Pope explains how AI chips work from fundamental logic gates up, revealing that the physical cost of moving data between memory and compute units (via multiplexers) often exceeds the cost of the actual mathematical operations, and that circuit area scales quadratically with precision, making low-precision arithmetic exponentially more efficient than commonly assumed.
🔢 The Computational Primitive 2 insights
Multiply-accumulate drives AI workloads
Matrix multiplication, the core of AI inference and training, decomposes into repeated multiply-accumulate (MAC) operations where pairs of numbers are multiplied and added to a running sum.
Asymmetric precision prevents error accumulation
AI chips use lower precision for multiplication (e.g., 4-bit) than accumulation (e.g., 8-bit) because rounding errors compound during summation but not during single multiplication steps.
⚡ Physical Circuit Implementation 3 insights
AND gates generate partial products
A p-bit by q-bit multiplication requires p×q AND gates to produce all partial products by pairwise combining bits from both inputs.
Full adders compress bits efficiently
The Dadda multiplier algorithm uses full adders (3:2 compressors that sum three input bits into two output bits) to reduce the partial product grid, requiring exactly p×q full adders total.
Circuit area scales quadratically
Because both AND gates and full adders scale as p×q with bit width, halving precision from 8-bit to 4-bit theoretically yields 4x circuit efficiency, not merely 2x.
💰 The Hidden Cost of Data Movement 2 insights
Multiplexers dominate hardware costs
Reading from a register file requires multiplexers that cost approximately 3×n×p gates (for n registers of p bits), which for typical configurations exceeds the p×q cost of the actual multiply-accumulate unit.
Data movement exceeds computation expense
In the example discussed, moving data from an 8-entry register file to the ALU requires 24p gates versus only 4p gates for the 4-bit multiplication itself, making data movement 6x more expensive than the math.
🎯 Precision Tradeoffs in Modern Chips 2 insights
FP4 and FP8 require dedicated circuits
Unlike software abstractions, hardware cannot fungibly switch between precisions; designers must allocate fixed die area to FP4 and FP8 circuits based on expected workloads.
Nvidia acknowledges quadratic scaling
While historical GPUs showed 2x speedup when halving precision (linear scaling), Nvidia's B300 and newer chips report 3x speedup for FP4 vs FP8, moving toward the theoretical 4x efficiency dictated by quadratic area scaling.
Bottom Line
In AI chip design, optimizing data movement between memory and compute units delivers greater efficiency gains than optimizing the mathematical operations themselves, and reducing numerical precision yields quadratically better hardware utilization—making aggressive quantization strategies far more valuable than linear scaling suggests.
More from Dwarkesh Patel
View all
Building AlphaGo from scratch – Eric Jang
Eric Jang demonstrates how modern LLM coding tools and algorithmic improvements have democratized AI research, enabling a single researcher to rebuild AlphaGo for thousands of dollars rather than millions, while explaining how Monte Carlo Tree Search combined with neural networks solved a game previously considered computationally intractable.
David Reich – Why the Bronze Age was an inflection point in human evolution
Geneticist David Reich reveals that contrary to decades of evolutionary theory, natural selection has been rampant in human populations over the last 10,000 years, with the Bronze Age triggering an unprecedented acceleration in genetic adaptation to immune and metabolic challenges.
The math behind how LLMs are trained and served – Reiner Pope
Reiner Pope explains the mathematical mechanics behind LLM inference costs, demonstrating how 'Fast Mode' APIs charge premiums for smaller batch sizes that reduce latency, and why physical memory bandwidth constraints create hard limits on how fast or cheap inference can get regardless of budget.
Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat
Jensen Huang explains how Nvidia's 'electrons to tokens' full-stack ecosystem and massive supply chain commitments create a durable moat against commoditization and TPU competition, while arguing that AI agents will exponentially increase software tool usage rather than replace it.