DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui, Zhu, Tiejun Zhao, Muyun Yang

TL;DR
DuplexMamba is a novel multimodal duplex model that enables real-time, streaming speech-to-text conversations with simultaneous input processing and output generation, improving efficiency for human-machine interactions.
Contribution
It introduces a Mamba-based end-to-end model with a duplex decoding strategy for real-time speech conversation, supporting streaming and outperforming traditional Transformer models.
Findings
Achieves real-time duplex speech processing with performance comparable to Transformer models.
Supports dynamic streaming input and output for natural conversations.
Demonstrates effectiveness on speech recognition and voice assistant benchmarks.
Abstract
Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech and dialogue systems · Speech Recognition and Synthesis
