Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
TL;DR
Maxime Labonne explains that small language models (350M–24B parameters) for edge deployment face unique architectural and training challenges distinct from simply scaling down large models, requiring specialized solutions like short convolutions, massive over-training, and targeted reinforcement learning to overcome memory constraints and 'doom looping' while excelling at agentic tool use.
💾 Architecture for Memory-Bound Devices 3 insights
Embedding layer inefficiency in small models
Mainstream small models like Gemma 3 270M and Qwen 3.5 0.8B waste 63% and 29% of parameters respectively on massive embedding layers due to distillation from large-vocabulary teachers, leaving fewer effective parameters for reasoning.
Short convolutions maximize throughput
LFM2 utilizes gated short convolutions that profile significantly faster than sliding window attention and GQA on both AMD Ryzen CPUs and Samsung Galaxy S25 Ultra devices while using less memory.
Hardware-first optimization
Effective edge architecture requires on-device profiling on target hardware rather than theoretical optimization, identifying operators that minimize latency under strict memory constraints.
📈 Training Beyond Traditional Scaling Laws 3 insights
Massive over-training improves small models
Liquid AI trains its 350M parameter model on 28 trillion tokens, violating Chinchilla scaling laws but aligning with newer test-time scaling laws and delivering significant gains across knowledge and tool-use benchmarks.
RL efficiency at small scale
Reinforcement learning is extremely effective for small models but requires diverse task environments and careful curation of cold-start SFT data to prevent training instability.
Narrow capability focus
Small models should specialize in specific high-value tasks like data extraction and function calling rather than competing as general-purpose chatbots or math solvers.
🔄 Solving Doom Looping 3 insights
The doom looping phenomenon
Small reasoning models frequently enter infinite repetition loops on complex tasks, with Qwen 3.5 0.8B experiencing over 50% doom loops compared to LFM 2.5's near-zero rate after optimization.
Preference alignment with diverse rollouts
Generating five temperature-sampled rollouts versus one greedy rollout during on-policy DPO data generation allows an LLM jury to identify and reject looping sequences as negative examples.
RL with verifiable rewards
Combining reinforcement learning with verifiable rewards and n-gram repetition penalties reduced LFM 2.5 1.2B's doom loop ratio from 16% to almost zero.
🤖 Agentic Deployment Strategy 2 insights
Tools compensate for knowledge gaps
Small models overcome low parametric knowledge capacity through web search and Python tools, making them highly capable agents for latency-sensitive, offline, or privacy-critical environments like automotive and healthcare.
Recursive environments bypass context limits
Long-context limitations can be solved through recursive agentic workflows and code execution rather than attempting to scale context windows architecturally.
Bottom Line
Treat small models as specialized, agentic systems that leverage external tools and extensive pre-training rather than attempting to build miniature general-purpose chatbots.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.