SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Ruiqi Yan; Wenxi Chen; Zhanxun Liu; Ziyang Ma; Haopeng Lin; Hanlin Wen; Hanke Xie; Jun Wu; Yuzhe Liang; Yuxiang Zhao; Pengchao Feng; Jiale Qian; Hao Meng; Yuhang Dai; Shunshun Yin; Ming Tao; Lei Xie; Kai Yu; Xinsheng Wang; Xie Chen

arXiv:2603.14877·eess.AS·March 17, 2026

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Ruiqi Yan, Wenxi Chen, Zhanxun Liu, Ziyang Ma, Haopeng Lin, Hanlin Wen, Hanke Xie, Jun Wu, Yuzhe Liang, Yuxiang Zhao, Pengchao Feng, Jiale Qian, Hao Meng, Yuhang Dai, Shunshun Yin, Ming Tao, Lei Xie, Kai Yu, Xinsheng Wang, Xie Chen

PDF

Open Access 1 Models 1 Datasets

TL;DR

SoulX-Duplug is a plug-and-play streaming module for full-duplex speech systems that improves dialogue state prediction and latency, leveraging textual info for better intent recognition and semantic VAD.

Contribution

It introduces a novel streaming state prediction module that is plug-and-play, explicitly uses textual information for intent detection, and extends evaluation benchmarks for better assessment.

Findings

01

Outperforms existing models in turn management

02

Enables low-latency dialogue control

03

Improves bilingual coverage in evaluation benchmarks

Abstract

Recent advances in spoken dialogue systems have brought increased attention to human-like full-duplex voice interactions. However, our comprehensive review of this field reveals several challenges, including the difficulty in obtaining training data, catastrophic forgetting, and limited scalability. In this work, we propose SoulX-Duplug, a plug-and-play streaming state prediction module for full-duplex spoken dialogue systems. By jointly performing streaming ASR, SoulX-Duplug explicitly leverages textual information to identify user intent, effectively serving as a semantic VAD. To promote fair evaluation, we introduce SoulX-Duplug-Eval, extending widely used benchmarks with improved bilingual coverage. Experimental results show that SoulX-Duplug enables low-latency streaming dialogue state control, and the system built upon it outperforms existing full-duplex models in overall turn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Soul-AILab/SoulX-Duplug-0.6B
model· 115 dl· ♡ 12
115 dl♡ 12

Datasets

Soul-AILab/SoulX-Duplug-Eval
dataset· 409 dl
409 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Emotion and Mood Recognition