Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs
TL;DR
Angelos Perivolaropoulos from ElevenLabs demonstrates how to train a GPT-2 style language model from scratch using only PyTorch and minimal dependencies, revealing that modern LLM development relies 80% on training methodology and optimization rather than architectural novelty.
🛠️ Workshop Logistics 2 insights
Consumer Hardware Training
The model trains on laptops with just 16GB RAM or free Google Colab GPUs, requiring no specialized infrastructure.
Minimal Dependencies
Participants use UV for Python environment management and pure PyTorch without high-level frameworks like HuggingFace Transformers.
🏗️ Tokenization & Architecture 3 insights
Character-Level Efficiency
Using only 65 unique characters instead of 50,000+ word-piece tokens ensures the Shakespeare dataset covers all 4,225 possible bigrams.
GPT-2 Building Blocks
The transformer architecture consists of multi-head self-attention, MLP feed-forward layers, residual connections for gradient stability, and layer normalization to prevent exploding activations.
Embedding Size Trade-offs
With 384-dimensional embeddings, a 65-token vocabulary creates 25,000 parameters, whereas GPT-2's 50,000 tokens would require 19 million parameters—three times the model size.
⚙️ Training Philosophy 2 insights
Training Loop Dominates Performance
The difference between GPT-4 and GPT-5 or Gemini 3 and 3.1 stems primarily from post-training and fine-tuning strategies, not base architecture changes.
Bigram Coverage Mathematics
A 200,000-token vocabulary requires quadrillions of tokens to cover all possible bigrams (200K²), while 65 tokens requires only 4,225 combinations.
Bottom Line
Focus on mastering the training loop and data strategy rather than architectural complexity, as training methodology drives the majority of LLM performance improvements while allowing models to run on consumer hardware.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.