HeadlinesBriefing favicon HeadlinesBriefing.com

Building a 400ms Voice Agent: How One Developer Beat Off-the-Shelf Platforms

Hacker News •
×

Developer Nick Tikhonov built a voice agent from scratch that achieves ~400ms end-to-end latency, outperforming commercial platforms like Vapi by 2×. The key breakthrough came from recognizing that voice is a turn-taking problem, not just a transcription challenge. By streaming STT, LLM, and TTS in real-time and optimizing geographic colocation, he created a system that responds with near-human timing.

Tikhonov's approach centered on a simple state machine: is the user speaking or listening? The two critical transitions—canceling instantly on barge-in and responding immediately on end-of-turn—define the conversational experience. He found that TTFT (Time to First Token) dominates everything in voice, with Groq's ~80ms TTFT being the single biggest performance win. The system uses Deepgram's Flux for combined transcription and turn detection, replacing basic Voice Activity Detection that fails with natural speech patterns.

The project demonstrates that building core orchestration layers yourself can yield significant performance gains over abstraction-heavy platforms. Tikhonov spent roughly $100 in API credits and completed the build in about a day, proving that with the right architecture and model choices, developers can create voice agents that match or exceed commercial offerings in responsiveness and natural conversation flow.