400ms Voice Agent: STT LLM TTS Streaming Architecture

Voice Agent Architecture and Performance

Nick Tikhonov built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). This includes full STT → LLM → TTS in the loop with clean barge-ins and no precomputed responses. The implementation outperformed Vapi's equivalent setup by 2× on latency.

Core Technical Insights

The key realization was that voice is a turn-taking problem, not a transcription problem. Voice Activity Detection (VAD) alone fails; semantic end-of-turn detection is required. The system reduces to one loop with two states: speaking vs listening.

The critical transitions are:

Cancel instantly on barge-in
Respond instantly on end-of-turn

Technical Requirements

STT → LLM → TTS must stream. Sequential pipelines are ineffective for natural conversation. Time To First Token (TTFT) dominates everything in voice interfaces - the first token is the critical path. Groq's ~80ms TTFT was identified as the single biggest performance win.

Infrastructure Considerations

Geography matters more than prompts. All components must be colocated or latency becomes prohibitive before the system even starts processing. The build took approximately one day and roughly $100 in API credits.

Why Voice Agents Are Challenging

Voice agents represent a significant complexity increase compared to text agents. The orchestration is continuous and real-time, requiring careful management of multiple models simultaneously. The system must constantly decide whether the user is speaking or listening, with transitions between these states being the most difficult aspect.

When the user starts speaking, the agent must immediately stop talking - cancel generation, cancel speech synthesis, and flush any buffered audio. When the user stops speaking, the system must confidently decide they're done and start responding with minimal delay.

Architecture Approach

The developer started by iterating on architecture with ChatGPT outside the editor to build a mental model first. The entire problem was reduced to a single loop and a tiny state machine. The core question a voice agent needs to answer is: is the user speaking, or listening?

The two states are: