Building a Sub-500ms Voice Agent: Architecture and Performance Insights

Voice Agent Architecture and Performance
Nick Tikhonov built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). This includes full STT → LLM → TTS in the loop with clean barge-ins and no precomputed responses. The implementation outperformed Vapi's equivalent setup by 2× on latency.
Core Technical Insights
The key realization was that voice is a turn-taking problem, not a transcription problem. Voice Activity Detection (VAD) alone fails; semantic end-of-turn detection is required. The system reduces to one loop with two states: speaking vs listening.
The critical transitions are:
- Cancel instantly on barge-in
- Respond instantly on end-of-turn
Technical Requirements
STT → LLM → TTS must stream. Sequential pipelines are ineffective for natural conversation. Time To First Token (TTFT) dominates everything in voice interfaces - the first token is the critical path. Groq's ~80ms TTFT was identified as the single biggest performance win.
Infrastructure Considerations
Geography matters more than prompts. All components must be colocated or latency becomes prohibitive before the system even starts processing. The build took approximately one day and roughly $100 in API credits.
Why Voice Agents Are Challenging
Voice agents represent a significant complexity increase compared to text agents. The orchestration is continuous and real-time, requiring careful management of multiple models simultaneously. The system must constantly decide whether the user is speaking or listening, with transitions between these states being the most difficult aspect.
When the user starts speaking, the agent must immediately stop talking - cancel generation, cancel speech synthesis, and flush any buffered audio. When the user stops speaking, the system must confidently decide they're done and start responding with minimal delay.
Architecture Approach
The developer started by iterating on architecture with ChatGPT outside the editor to build a mental model first. The entire problem was reduced to a single loop and a tiny state machine. The core question a voice agent needs to answer is: is the user speaking, or listening?
The two states are:
- The user is speaking
- The user is listening
This turn-detection logic forms the core of every voice system. The implementation is available on GitHub for reference and further development.
📖 Read the full source: HN AI Agents
👀 See Also

claude-powerline v1.20 adds TUI dashboard mode, context bar styles, and environment variable display
claude-powerline v1.20 introduces a TUI dashboard mode that replaces the single statusline with a full panel showing model info, context usage with progress bar, costs, git status, and more. The update adds 9 visual progress bar styles for context usage and environment variable display capability.

Open Source Second Brain System Built on Claude Code for Task Management
An open source system called Kipi System uses Claude Code to track open threads, draft follow-ups, and manage tasks by pulling from calendar, email, CRM, and social feeds. It generates a daily HTML file with pre-written actions sorted by friction.

altRAG: Replace Vector DB RAG with 2KB Pointer Files for AI Coding Agents
altRAG is a Python tool that replaces vector database RAG with lightweight pointer files. It scans Markdown/YAML skill files to create a 2KB skeleton file mapping sections to exact line numbers and byte offsets, allowing AI agents to read only needed sections instead of entire files.

CLI-Anything-WEB: Open-source plugin that reverse-engineers any website into a Python CLI for Claude Code
CLI-Anything-WEB is an open-source Claude Code plugin that watches your browser traffic, reverse-engineers the protocol, and generates a full Python CLI with auth, tests, and --json support. 19 sample CLIs included for sites like Reddit, Booking, Airbnb, ChatGPT, and LinkedIn.