Language Model Can Listen While Speaking

Ziyang Ma; Yakun Song; Chenpeng Du; Jian Cong; Zhuo Chen; Yuping Wang,; Yuxuan Wang; Xie Chen

arXiv:2408.02622·cs.CL·August 6, 2024

Language Model Can Listen While Speaking

Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang,, Yuxuan Wang, Xie Chen

PDF

Open Access 1 Video

TL;DR

This paper introduces a listening-while-speaking language model (LSLM) that enables real-time, duplex speech interaction, allowing for interruptions and more natural human-computer conversations.

Contribution

The paper presents a novel end-to-end model with listening and speaking channels, employing fusion strategies to improve real-time interaction and interruption handling in speech dialogue systems.

Findings

01

Middle fusion achieves best balance between speech generation and interaction

02

LSLM is robust to noise and diverse instructions

03

Enables duplex communication with minimal system impact

Abstract

Dialogue serves as the most natural manner of human-computer interaction (HCI). Recent advancements in speech language models (SLM) have significantly enhanced speech-based conversational AI. However, these models are limited to turn-based conversation, lacking the ability to interact with humans in real-time spoken scenarios, for example, being interrupted when the generated content is not satisfactory. To address these limitations, we explore full duplex modeling (FDM) in interactive speech language models (iSLM), focusing on enhancing real-time interaction and, more explicitly, exploring the quintessential ability of interruption. We introduce a novel model design, namely listening-while-speaking language model (LSLM), an end-to-end system equipped with both listening and speaking channels. Our LSLM employs a token-based decoder-only TTS for speech generation and a streaming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Language Model Can Listen While Speaking· underline

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques