Language Model Can Listen While Speaking
Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang,, Yuxuan Wang, Xie Chen

TL;DR
This paper introduces a listening-while-speaking language model (LSLM) that enables real-time, duplex speech interaction, allowing for interruptions and more natural human-computer conversations.
Contribution
The paper presents a novel end-to-end model with listening and speaking channels, employing fusion strategies to improve real-time interaction and interruption handling in speech dialogue systems.
Findings
Middle fusion achieves best balance between speech generation and interaction
LSLM is robust to noise and diverse instructions
Enables duplex communication with minimal system impact
Abstract
Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
