Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial
Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

TL;DR
This paper provides a comprehensive tutorial on building self-hosted enterprise-grade realtime voice agents, focusing on a cascaded pipeline approach due to current limitations of end-to-end models, and includes a complete codebase.
Contribution
It introduces a practical, step-by-step tutorial for constructing self-hosted realtime voice agents using existing components, filling a gap in accessible technical guidance.
Findings
Cascaded pipeline (STT → LLM → TTS) is the practical architecture for self-hosted realtime voice agents.
Achieved a time-to-first-audio of approximately 730-755ms with full function calling support.
Evaluated various configurations of Qwen3-Omni, highlighting current limitations for real-time self-hosted deployment.
Abstract
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves 702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at 146s -- far too slow for realtime. The cascaded streaming pipeline (STT LLM TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Natural Language Processing Techniques · Multimodal Machine Learning Applications
