Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
TL;DR
Maxime Labonne explains that small language models (350M–24B parameters) for edge deployment face unique architectural and training challenges distinct from simply scaling down large models, requiring specialized solutions like short convolutions, massive over-training, and targeted reinforcement learning to overcome memory constraints and 'doom looping' while excelling at agentic tool use.
💾 Architecture for Memory-Bound Devices 3 insights
Embedding layer inefficiency in small models
Mainstream small models like Gemma 3 270M and Qwen 3.5 0.8B waste 63% and 29% of parameters respectively on massive embedding layers due to distillation from large-vocabulary teachers, leaving fewer effective parameters for reasoning.
Short convolutions maximize throughput
LFM2 utilizes gated short convolutions that profile significantly faster than sliding window attention and GQA on both AMD Ryzen CPUs and Samsung Galaxy S25 Ultra devices while using less memory.
Hardware-first optimization
Effective edge architecture requires on-device profiling on target hardware rather than theoretical optimization, identifying operators that minimize latency under strict memory constraints.
📈 Training Beyond Traditional Scaling Laws 3 insights
Massive over-training improves small models
Liquid AI trains its 350M parameter model on 28 trillion tokens, violating Chinchilla scaling laws but aligning with newer test-time scaling laws and delivering significant gains across knowledge and tool-use benchmarks.
RL efficiency at small scale
Reinforcement learning is extremely effective for small models but requires diverse task environments and careful curation of cold-start SFT data to prevent training instability.
Narrow capability focus
Small models should specialize in specific high-value tasks like data extraction and function calling rather than competing as general-purpose chatbots or math solvers.
🔄 Solving Doom Looping 3 insights
The doom looping phenomenon
Small reasoning models frequently enter infinite repetition loops on complex tasks, with Qwen 3.5 0.8B experiencing over 50% doom loops compared to LFM 2.5's near-zero rate after optimization.
Preference alignment with diverse rollouts
Generating five temperature-sampled rollouts versus one greedy rollout during on-policy DPO data generation allows an LLM jury to identify and reject looping sequences as negative examples.
RL with verifiable rewards
Combining reinforcement learning with verifiable rewards and n-gram repetition penalties reduced LFM 2.5 1.2B's doom loop ratio from 16% to almost zero.
🤖 Agentic Deployment Strategy 2 insights
Tools compensate for knowledge gaps
Small models overcome low parametric knowledge capacity through web search and Python tools, making them highly capable agents for latency-sensitive, offline, or privacy-critical environments like automotive and healthcare.
Recursive environments bypass context limits
Long-context limitations can be solved through recursive agentic workflows and code execution rather than attempting to scale context windows architecturally.
Bottom Line
Treat small models as specialized, agentic systems that leverage external tools and extensive pre-training rather than attempting to build miniature general-purpose chatbots.
More from AI Engineer
View all
Human-in-the-Loop Automation with n8n — Liam McGarrigle
Liam McGarrigle demonstrates building AI agents in n8n using visual workflows, emphasizing transparent orchestration over black-box automation through configurable memory, chat triggers, and tool integration for practical business applications.
Mastering AI Pricing: Flexible & Agile Monetization — Mayank Pant, Stripe
AI companies are growing three times faster than traditional SaaS but face unique pricing challenges due to unpredictable compute costs and razor-thin margins, requiring a shift from static subscription models to flexible hybrid pricing that prioritizes rapid iteration and customer-perceived value over technical metrics.
Shipping complex AI applications — Braintrust & Trainline
This workshop demonstrates how to bridge the gap between AI prototypes and production systems using Brain Trust's observability platform, featuring Trainline's experience deploying multi-agent AI applications serving 27 million users.
Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind
Google DeepMind engineers Thor Schaeff and Philipp Schmid demonstrate building conversational agents using the new Gemini Interactions API, a unified interface that supports both direct model inference and complex autonomous agents like Deep Research with server-side state management and asynchronous execution.