Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
This lecture bridges the gap between GPU programming abstractions and hardware realities, explaining how thread hierarchies, memory systems, and hardware constraints like warps and bank conflicts determine kernel performance for deep learning workloads.