Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

AI Engineer

| Podcasts | May 04, 2026 | 2.64 Thousand views | 1:21:26

TL;DR

Angelos Perivolaropoulos from ElevenLabs demonstrates how to train a GPT-2 style language model from scratch using only PyTorch and minimal dependencies, revealing that modern LLM development relies 80% on training methodology and optimization rather than architectural novelty.

🛠️ Workshop Logistics 2 insights

Consumer Hardware Training

The model trains on laptops with just 16GB RAM or free Google Colab GPUs, requiring no specialized infrastructure.

Minimal Dependencies

Participants use UV for Python environment management and pure PyTorch without high-level frameworks like HuggingFace Transformers.

🏗️ Tokenization & Architecture 3 insights

Character-Level Efficiency

Using only 65 unique characters instead of 50,000+ word-piece tokens ensures the Shakespeare dataset covers all 4,225 possible bigrams.

GPT-2 Building Blocks

The transformer architecture consists of multi-head self-attention, MLP feed-forward layers, residual connections for gradient stability, and layer normalization to prevent exploding activations.

Embedding Size Trade-offs

With 384-dimensional embeddings, a 65-token vocabulary creates 25,000 parameters, whereas GPT-2's 50,000 tokens would require 19 million parameters—three times the model size.

⚙️ Training Philosophy 2 insights

Training Loop Dominates Performance

The difference between GPT-4 and GPT-5 or Gemini 3 and 3.1 stems primarily from post-training and fine-tuning strategies, not base architecture changes.

Bigram Coverage Mathematics

A 200,000-token vocabulary requires quadrillions of tokens to cover all possible bigrams (200K²), while 65 tokens requires only 4,225 combinations.

Bottom Line

Focus on mastering the training loop and data strategy rather than architectural complexity, as training methodology drives the majority of LLM performance improvements while allowing models to run on consumer hardware.

Watch on YouTube

More from AI Engineer

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

AI Engineer

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

Pedro Rodrigues from Supabase details how structured 'skills'—markdown-based instruction sets with progressive disclosure—dramatically improve AI agent performance with complex products, distinguishing them from MCP tools and establishing an evaluation-driven development framework for systematic testing.

about 15 hours ago · 10 points

Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick

AI Engineer

Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick

Chris Parsons introduces 'Ralph Loops'—a minimalist automation approach where repeatedly prompting an AI agent with the same task outperforms complex orchestration workflows, leveraging the model's self-correction to ship better code with less maintenance.

about 17 hours ago · 9 points

TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

AI Engineer

TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

Cormac Brick from Google AI Edge introduces Tiny LLMs (TLMs) and on-device agent capabilities powered by LiteRT-LM and the new Gemma 4 models, demonstrating how fine-tuned small models (100M-4B parameters) can now deliver sophisticated AI experiences—including multimodal reasoning and tool use—directly on mobile phones, laptops, and even Raspberry Pis without cloud dependency.

1 day ago · 10 points

Mergeable by default: Building the context engine to save time and tokens — Peter Werry, Unblocked

AI Engineer

Mergeable by default: Building the context engine to save time and tokens — Peter Werry, Unblocked

Peter Werry argues that as AI agents move toward autonomous 'YOLO mode' execution, simple RAG and MCP connections fail to provide adequate organizational context, creating bottlenecks and 'satisfaction of search' failures where agents stop at superficial answers instead of understanding the historical 'why' behind code decisions.

1 day ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories