Bypassing the Multimodal Tax: Hybrid RAG, SQL RRF & UI Telemetry - Abed Matini, Ogilvy

AI Engineer

| Podcasts | June 28, 2026 | 368 views | 45:48

TL;DR

Abed Matini presents a framework-free Hybrid RAG architecture that eliminates pre-query token costs by preprocessing documents locally using DocLink and multiple chunking strategies, while implementing SQL-based Reciprocal Rank Fusion and LangFuse telemetry for production observability.

💰 Eliminating the Multimodal Tax 2 insights

Pre-upload token waste

Uploading documents to cloud LLMs consumes tokens during ingestion before users ask questions, creating unnecessary costs especially for large employee handbooks or frequent uploads.

Production tool complexity

Traditional RAG implementations require managing separate vector databases alongside keyword and semantic search tools, increasing operational overhead and debugging difficulty.

🏗️ Local-First Architecture 3 insights

DocLink markdown conversion

Documents are converted to markdown locally using Python and DocLink, enabling CPU-only processing via Ollama without GPU requirements or cloud API costs.

SQL-based hybrid search

Implements Reciprocal Rank Fusion (RRF) directly in PostgreSQL to combine keyword and semantic search, eliminating the need for separate vector database infrastructure.

Multi-format ingestion

Supports PDFs, Word docs, PowerPoints, and images via OCR conversion, allowing screenshots of emails or maintenance notices to be quickly indexed as temporary knowledge.

🧩 Strategic Document Chunking 3 insights

Heading-based extraction

Chunks documents by headers to create clean Q&A pairs, ideal for FAQs and handbooks with precise traceability to source references.

Fixed-length fallback

Uses 512-character chunks with 64-character overlap for unstructured data when semantic boundaries like paragraphs or headings are unavailable.

Sentence grouping for images

Converts image screenshots to text using sentence-based chunking for rapid deployment of temporary announcements like maintenance windows.

📊 Observability and Cost Control 2 insights

LangFuse telemetry

Tracks chat latency, retrieval paths, and token usage while monitoring for prompt injection attacks and risky query patterns.

Pre-deployment chunk preview

Admin dashboard displays exactly how documents are segmented before going live, enabling debugging of retrieval quality and chunk relevance issues.

Bottom Line

Preprocess documents locally using heading-aware chunking and PostgreSQL Reciprocal Rank Fusion to eliminate cloud token costs while maintaining full observability through LangFuse telemetry.

Watch on YouTube

More from AI Engineer

Frontier results, on device - RL Nabors, Arize

AI Engineer

Frontier results, on device - RL Nabors, Arize

Rachel Lee Neighbors introduces a framework for replacing expensive cloud-based frontier models with Small Language Models (SLMs) running on-device, demonstrating how a systematic 'prototype big, deploy small' approach using evaluation tools like Phoenix can cut inference costs to zero while maintaining 90% accuracy and enabling offline functionality.

about 11 hours ago · 10 points

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

AI Engineer

The Future Is Domain-Specific Agents - Justin Schroeder, StandardAgents

Justin Schroeder argues that the future of AI lies in domain-specific agents—small, specialized agents that compose together rather than general-purpose agents bloated with tools and skills, delivering 80%+ token efficiency and 137x cost savings compared to monolithic approaches.

about 12 hours ago · 9 points

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

AI Engineer

The Agentic AI Engineer - Benedikt Sanftl, Mutagent

Benedikt Sanftl and Burak from Mutagent present the 'Agentic AI Engineer' paradigm, where specialized AI agents autonomously manage the entire lifecycle of building, evaluating, and optimizing other agents through automated offline and online loops, solving the scalability bottlenecks of manual development.

about 13 hours ago · 10 points

Agents Building Agents - Alfonso Graziano, Nearform

AI Engineer

Agents Building Agents - Alfonso Graziano, Nearform

Alfonso Graziano from NearForm demonstrates how coding agents can autonomously improve AI agent performance through iterative evaluation loops, achieving 18% to 83% accuracy gains on new agents and 10% improvements on production systems already optimized by humans.

about 23 hours ago · 9 points

Browse more: 🎙️ Podcasts All Videos All Categories