TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

| Podcasts | May 03, 2026 | 13.2 Thousand views | 1:20:58

TL;DR

Cormac Brick from Google AI Edge introduces Tiny LLMs (TLMs) and on-device agent capabilities powered by LiteRT-LM and the new Gemma 4 models, demonstrating how fine-tuned small models (100M-4B parameters) can now deliver sophisticated AI experiences—including multimodal reasoning and tool use—directly on mobile phones, laptops, and even Raspberry Pis without cloud dependency.

📱 Edge AI Deployment Patterns 3 insights

System-level versus In-app GenAI architectures

System-level GenAI uses 2-5B parameter foundation models integrated into mobile OSs like Android AI Core for general tasks, while In-app GenAI deploys Tiny LLMs (100-500M parameters) fine-tuned for specific functions to ensure compatibility across all devices including non-premium hardware.

Privacy and latency advantages drive edge adoption

On-device processing keeps sensitive data like messages encrypted locally while eliminating network latency, enabling real-time applications such as live voice translation that would be impossible via cloud services.

Cross-platform deployment via LiteRT-LM

LiteRT-LM enables single-file deployment across Android, iOS, macOS, Linux, Windows, and web using CPU/GPU, with specialized NPU compilation available for dedicated hardware accelerators like Qualcomm robotics platforms.

🧠 Gemma 4 Model Capabilities 3 insights

Memory-efficient E2B and E4B variants

Gemma 4 E2B and E4B require only 2B and 4B parameters in RAM respectively during inference by memory-mapping per-layer embeddings rather than loading full tables, enabling operation on premium mobile devices with limited RAM.

Native agent capabilities built-in

Unlike previous generations, Gemma 4 integrates native function calling and reasoning modes, allowing the model to autonomously use tools and chain thoughts without external orchestration layers.

Multimodal inputs and open licensing

E2B and E4B support audio, image, and text inputs, while all Gemma 4 models release under Apache 2.0 license for unrestricted commercial use and broader accessibility.

Performance and Hardware Benchmarks 3 insights

High-throughput on mobile and desktop

Gemma 4 E2B achieves thousands of tokens per second on high-end Android GPUs and Apple Silicon MacBooks, enabling responsive conversational interfaces and real-time agent interactions.

Viable edge and IoT deployment

The model runs at 133 tokens per second on Raspberry Pi 5 for image analysis and demonstrates strong performance on Qualcomm IoT platforms utilizing NPU acceleration, proving viability beyond smartphones.

Sub-billion parameter reliability

Google's 270M parameter Function Gemma model achieved 85-90% reliability on voice-to-function calling tasks across 10 Android functions, demonstrating that tiny models can handle specific production workflows when fine-tuned.

🤖 On-Device Agent Skills 3 insights

Autonomous skill execution framework

Gemma 4's combination of reasoning and tool use enables agent skills where models autonomously select and execute functions based on natural language descriptions without cloud dependency.

Google AI Gallery demonstration app

The cross-platform demo app showcases practical implementations including voice-controlled function calling, multimodal chat, and real-time translation running entirely locally on both Android and iOS devices.

Fine-tuning requirements for tiny models

Models under 500M parameters typically require task-specific fine-tuning for production reliability, while larger edge models can operate as general-purpose foundation models via prompting alone.

Bottom Line

Developers should evaluate task-specific fine-tuning of Tiny LLMs (100M-500M parameters) using LiteRT-LM for in-app GenAI to achieve privacy-preserving, offline-capable agent experiences, while leveraging system-level Gemma 4 models (2B-4B) for general reasoning on premium devices.

More from AI Engineer

View all
Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick
AI Engineer AI Engineer

Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick

Chris Parsons introduces 'Ralph Loops'—a minimalist automation approach where repeatedly prompting an AI agent with the same task outperforms complex orchestration workflows, leveraging the model's self-correction to ship better code with less maintenance.

about 17 hours ago · 9 points