Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Jielin Qiu; Zixiang Chen; Liangwei Yang; Ming Zhu; Zhiwei Liu; Juntao Tan; Wenting Zhao; Rithesh Murthy; Roshan Ram; Akshara Prabhakar; Shelby Heinecke; Caiming Xiong; Silvio Savarese; Huan Wang

arXiv:2603.05413·cs.SD·March 18, 2026

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang

PDF

Open Access

TL;DR

This paper provides a comprehensive tutorial on building self-hosted enterprise-grade realtime voice agents, focusing on a cascaded pipeline approach due to current limitations of end-to-end models, and includes a complete codebase.

Contribution

It introduces a practical, step-by-step tutorial for constructing self-hosted realtime voice agents using existing components, filling a gap in accessible technical guidance.

Findings

01

Cascaded pipeline (STT → LLM → TTS) is the practical architecture for self-hosted realtime voice agents.

02

Achieved a time-to-first-audio of approximately 730-755ms with full function calling support.

03

Evaluated various configurations of Qwen3-Omni, highlighting current limitations for real-time self-hosted deployment.

Abstract

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While end-to-end speech-to-speech models may ultimately provide the best latency for voice agents, fully self-hosted end-to-end solutions are not yet available. We evaluate the closest candidate, Qwen3-Omni, across three configurations: its cloud-only DashScope Realtime API achieves $\sim$ 702ms audio-to-audio latency with streaming, but is not self-hostable; its local vLLM deployment supports only the Thinker (text generation from audio, 516ms), not the Talker (audio synthesis); and its local Transformers deployment runs the full pipeline but at $\sim$ 146s -- far too slow for realtime. The cascaded streaming pipeline (STT $\to$ LLM $\to$ TTS) therefore remains the practical architecture for self-hosted realtime voice agents, and the focus of this tutorial. We build…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Natural Language Processing Techniques · Multimodal Machine Learning Applications