TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
TL;DR
Cormac Brick from Google AI Edge introduces Tiny LLMs (TLMs) and on-device agent capabilities powered by LiteRT-LM and the new Gemma 4 models, demonstrating how fine-tuned small models (100M-4B parameters) can now deliver sophisticated AI experiences—including multimodal reasoning and tool use—directly on mobile phones, laptops, and even Raspberry Pis without cloud dependency.
📱 Edge AI Deployment Patterns 3 insights
System-level versus In-app GenAI architectures
System-level GenAI uses 2-5B parameter foundation models integrated into mobile OSs like Android AI Core for general tasks, while In-app GenAI deploys Tiny LLMs (100-500M parameters) fine-tuned for specific functions to ensure compatibility across all devices including non-premium hardware.
Privacy and latency advantages drive edge adoption
On-device processing keeps sensitive data like messages encrypted locally while eliminating network latency, enabling real-time applications such as live voice translation that would be impossible via cloud services.
Cross-platform deployment via LiteRT-LM
LiteRT-LM enables single-file deployment across Android, iOS, macOS, Linux, Windows, and web using CPU/GPU, with specialized NPU compilation available for dedicated hardware accelerators like Qualcomm robotics platforms.
🧠 Gemma 4 Model Capabilities 3 insights
Memory-efficient E2B and E4B variants
Gemma 4 E2B and E4B require only 2B and 4B parameters in RAM respectively during inference by memory-mapping per-layer embeddings rather than loading full tables, enabling operation on premium mobile devices with limited RAM.
Native agent capabilities built-in
Unlike previous generations, Gemma 4 integrates native function calling and reasoning modes, allowing the model to autonomously use tools and chain thoughts without external orchestration layers.
Multimodal inputs and open licensing
E2B and E4B support audio, image, and text inputs, while all Gemma 4 models release under Apache 2.0 license for unrestricted commercial use and broader accessibility.
⚡ Performance and Hardware Benchmarks 3 insights
High-throughput on mobile and desktop
Gemma 4 E2B achieves thousands of tokens per second on high-end Android GPUs and Apple Silicon MacBooks, enabling responsive conversational interfaces and real-time agent interactions.
Viable edge and IoT deployment
The model runs at 133 tokens per second on Raspberry Pi 5 for image analysis and demonstrates strong performance on Qualcomm IoT platforms utilizing NPU acceleration, proving viability beyond smartphones.
Sub-billion parameter reliability
Google's 270M parameter Function Gemma model achieved 85-90% reliability on voice-to-function calling tasks across 10 Android functions, demonstrating that tiny models can handle specific production workflows when fine-tuned.
🤖 On-Device Agent Skills 3 insights
Autonomous skill execution framework
Gemma 4's combination of reasoning and tool use enables agent skills where models autonomously select and execute functions based on natural language descriptions without cloud dependency.
Google AI Gallery demonstration app
The cross-platform demo app showcases practical implementations including voice-controlled function calling, multimodal chat, and real-time translation running entirely locally on both Android and iOS devices.
Fine-tuning requirements for tiny models
Models under 500M parameters typically require task-specific fine-tuning for production reliability, while larger edge models can operate as general-purpose foundation models via prompting alone.
Bottom Line
Developers should evaluate task-specific fine-tuning of Tiny LLMs (100M-500M parameters) using LiteRT-LM for in-app GenAI to achieve privacy-preserving, offline-capable agent experiences, while leveraging system-level Gemma 4 models (2B-4B) for general reasoning on premium devices.
More from AI Engineer
View all
Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs
Angelos Perivolaropoulos from ElevenLabs demonstrates how to train a GPT-2 style language model from scratch using only PyTorch and minimal dependencies, revealing that modern LLM development relies 80% on training methodology and optimization rather than architectural novelty.
Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase
Pedro Rodrigues from Supabase details how structured 'skills'—markdown-based instruction sets with progressive disclosure—dramatically improve AI agent performance with complex products, distinguishing them from MCP tools and establishing an evaluation-driven development framework for systematic testing.
Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick
Chris Parsons introduces 'Ralph Loops'—a minimalist automation approach where repeatedly prompting an AI agent with the same task outperforms complex orchestration workflows, leveraging the model's self-correction to ship better code with less maintenance.
Mergeable by default: Building the context engine to save time and tokens — Peter Werry, Unblocked
Peter Werry argues that as AI agents move toward autonomous 'YOLO mode' execution, simple RAG and MCP connections fail to provide adequate organizational context, creating bottlenecks and 'satisfaction of search' failures where agents stop at superficial answers instead of understanding the historical 'why' behind code decisions.