TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
TL;DR
Cormac Brick from Google AI Edge introduces Tiny LLMs (TLMs) and on-device agent capabilities powered by LiteRT-LM and the new Gemma 4 models, demonstrating how fine-tuned small models (100M-4B parameters) can now deliver sophisticated AI experiences—including multimodal reasoning and tool use—directly on mobile phones, laptops, and even Raspberry Pis without cloud dependency.
📱 Edge AI Deployment Patterns 3 insights
System-level versus In-app GenAI architectures
System-level GenAI uses 2-5B parameter foundation models integrated into mobile OSs like Android AI Core for general tasks, while In-app GenAI deploys Tiny LLMs (100-500M parameters) fine-tuned for specific functions to ensure compatibility across all devices including non-premium hardware.
Privacy and latency advantages drive edge adoption
On-device processing keeps sensitive data like messages encrypted locally while eliminating network latency, enabling real-time applications such as live voice translation that would be impossible via cloud services.
Cross-platform deployment via LiteRT-LM
LiteRT-LM enables single-file deployment across Android, iOS, macOS, Linux, Windows, and web using CPU/GPU, with specialized NPU compilation available for dedicated hardware accelerators like Qualcomm robotics platforms.
🧠 Gemma 4 Model Capabilities 3 insights
Memory-efficient E2B and E4B variants
Gemma 4 E2B and E4B require only 2B and 4B parameters in RAM respectively during inference by memory-mapping per-layer embeddings rather than loading full tables, enabling operation on premium mobile devices with limited RAM.
Native agent capabilities built-in
Unlike previous generations, Gemma 4 integrates native function calling and reasoning modes, allowing the model to autonomously use tools and chain thoughts without external orchestration layers.
Multimodal inputs and open licensing
E2B and E4B support audio, image, and text inputs, while all Gemma 4 models release under Apache 2.0 license for unrestricted commercial use and broader accessibility.
⚡ Performance and Hardware Benchmarks 3 insights
High-throughput on mobile and desktop
Gemma 4 E2B achieves thousands of tokens per second on high-end Android GPUs and Apple Silicon MacBooks, enabling responsive conversational interfaces and real-time agent interactions.
Viable edge and IoT deployment
The model runs at 133 tokens per second on Raspberry Pi 5 for image analysis and demonstrates strong performance on Qualcomm IoT platforms utilizing NPU acceleration, proving viability beyond smartphones.
Sub-billion parameter reliability
Google's 270M parameter Function Gemma model achieved 85-90% reliability on voice-to-function calling tasks across 10 Android functions, demonstrating that tiny models can handle specific production workflows when fine-tuned.
🤖 On-Device Agent Skills 3 insights
Autonomous skill execution framework
Gemma 4's combination of reasoning and tool use enables agent skills where models autonomously select and execute functions based on natural language descriptions without cloud dependency.
Google AI Gallery demonstration app
The cross-platform demo app showcases practical implementations including voice-controlled function calling, multimodal chat, and real-time translation running entirely locally on both Android and iOS devices.
Fine-tuning requirements for tiny models
Models under 500M parameters typically require task-specific fine-tuning for production reliability, while larger edge models can operate as general-purpose foundation models via prompting alone.
Bottom Line
Developers should evaluate task-specific fine-tuning of Tiny LLMs (100M-500M parameters) using LiteRT-LM for in-app GenAI to achieve privacy-preserving, offline-capable agent experiences, while leveraging system-level Gemma 4 models (2B-4B) for general reasoning on premium devices.
More from AI Engineer
View all
The Production AI Playbook: Deploying Agents at Enterprise Scale — Sandipan Bhaumik, Databricks
Sandipan Bhaumik from Databricks presents a battle-tested five-pillar framework for deploying enterprise AI agents, arguing that starting with model selection leads to inevitable production failures while proper evaluation, observability, and data governance determine success at scale.
Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind
Google DeepMind's Gus Martins and Ian Ballantyne introduce Gemma 4, a family of open models (2B to 31B parameters) that deliver frontier-level intelligence with disproportionate efficiency, enabling sovereign AI ownership through local deployment, Apache 2.0 licensing, and on-device capabilities.
LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
Dat Ngo from Arize AI explains how modern AI systems require reimagined observability and evaluation patterns built on OpenTelemetry to manage non-deterministic agents, emphasizing that the future of AI engineering lies in automated experimentation flywheels that eliminate manual dashboard work.
Text Diffusion — Brendon Dillon, Google DeepMind
Google DeepMind researcher Brendon Dillon explains text diffusion as a parallel alternative to autoregressive language models that iteratively denoises random tokens rather than generating sequentially, offering significantly lower latency and unique capabilities like self-correction and adaptive computation, though currently limited by high serving costs for large batches.