Coding Challenge 188: Voice Chatbot
TL;DR
Daniel Shiffman builds a fully local voice chatbot in p5.js using Whisper for speech-to-text and Kokoro TTS for text-to-speech, demonstrating how to process audio entirely in the browser while advocating for creative, lightweight alternatives to large language models for the bot's 'brain'.
🎯 Local AI Architecture 3 insights
Browser-based speech processing with Whisper
Implements OpenAI's Whisper model via Transformers.js to convert speech to text locally using WebGPU acceleration, ensuring no audio data leaves the computer.
Open-source text-to-speech with Kokoro
Integrates the Kokoro TTS model to generate natural speech from text responses, loading the model directly from HuggingFace into the browser.
Zero-cloud audio privacy
All audio processing happens client-side, with microphone input never transmitting to external servers despite models downloading from the cloud.
💻 Technical Implementation 3 insights
Native Web Audio API workflow
Uses Navigator.mediaDevices and MediaRecorder to capture audio chunks into blobs for processing, avoiding the p5.sound library to demonstrate underlying web audio mechanics.
16kHz waveform preparation
Decodes audio data through AudioContext at 16,000Hz sample rate and extracts single-channel waveform data to match Whisper's specific input requirements.
Async/await pipeline setup
Leverages p5.js 2.0 features to asynchronously import Transformers.js and initialize machine learning pipelines within the sketch setup function.
🧠 Creative Bot Intelligence 3 insights
Alternatives to large language models
Demonstrates that chatbot 'brains' need not be LLMs, suggesting pattern-matching systems like ELIZA, RiveScript, or context-free grammars for simpler responses.
Demystifying AI through creative coding
Advocates learning AI through hands-on building with open-source models on consumer hardware, using creative play to understand and critique emerging technologies.
Push-to-talk interaction design
Implements a mouse-press interface to control when the bot listens, managing audio state between recording start and stop events to capture discrete utterances.
Bottom Line
You can build sophisticated voice interfaces using open-source AI models that run entirely in the browser on consumer hardware, choosing simple pattern-matching systems over massive LLMs for more creative control and privacy.
More from The Coding Train
View all
Coding Challenge 187: Bayes Theorem
The Coding Train demonstrates how to implement a Naive Bayes text classifier in JavaScript from scratch, using a concrete library book probability example to explain Bayes Theorem before coding a lightweight, browser-based word-frequency classification system.
Coding Challenge Session: Local Browser Conversational Chatbot (STT, TTS, and more?)
Daniel Shiffman builds a local browser-based conversational chatbot using p5.js and Transformers.js, demonstrating how to run lightweight open-source AI models (Whisper for speech-to-text, Kokoro for text-to-speech) entirely in the browser without cloud dependencies.
More in Programming
View all
IT Fundamentals Course – Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, Linux
This comprehensive IT fundamentals course provides a streamlined, practical alternative to traditional certification paths, covering hardware, networking, cloud computing, and DevOps through hands-on AWS practice to help beginners quickly navigate modern IT career options.
Gemini CLI Essentials – Full Course
This course prepares viewers for the Gemini CLI certification (EXP Gemini CLI01), covering Google's agentic coding tool that automates development tasks while highlighting critical limitations including restrictive token outputs and significant billing transparency issues compared to competitors like Claude Code and Codex.