X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Zhanxun Liu; Yifan Duan; Mengmeng Wang; Pengchao Feng; Haotian Zhang; Xiaoyu Xing; Yijia Shan; Haina Zhu; Yuhang Dai; Chaochao Lu; Xipeng Qiu; Lei Xie; Lan Wang; Nan Yan; Zilong Zheng; Ziyang Ma; Kai Yu; Xie Chen

arXiv:2512.18706·cs.SD·December 23, 2025

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System

Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, Xipeng Qiu, Lei Xie, Lan Wang, Nan Yan, Zilong Zheng, Ziyang Ma, Kai Yu, Xie Chen

PDF

Open Access

TL;DR

X-Talk introduces a modular speech-to-speech system framework that achieves low latency and high flexibility by combining specialized components with large language models, challenging the end-to-end paradigm.

Contribution

The paper presents a decoupled, modular framework for speech-to-speech systems that maintains low latency and flexibility, demonstrating the potential of cascaded pipelines over end-to-end models.

Findings

01

Achieves sub-second latency with modular design

02

Integrates diverse front-end and understanding components

03

Revitalizes cascaded approach for speech systems

Abstract

We present X-Talk, an open-source framework that champions a decoupled, modular design for LLM-driven speech-to-speech (S2S) systems. While the dominant trend favors end-to-end (E2E) modeling to optimize information flow, these "omni-models" often struggle to balance the competing objectives of complex speech tasks within a single network. X-Talk challenges this paradigm by demonstrating that a systematically optimized cascaded pipeline can achieve sub-second latency without sacrificing modular flexibility. Our framework seamlessly integrates specialized front-end components (e.g., VAD, speech enhancement) and diverse understanding models (e.g., ASR, emotion, and environmental sound analysis) with LLM capabilities like retrieval-augmented generation (RAG) and tool use. By revitalizing the cascaded approach, X-Talk highlights the underestimated potential of modular S2S systems and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Emotion and Mood Recognition