A Full-duplex Speech Dialogue Scheme Based On Large Language Models
Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, Yuanjun Xiong

TL;DR
This paper introduces a full-duplex speech dialogue system using large language models that can listen and speak simultaneously, significantly reducing response latency and improving interruption accuracy in real-time interactions.
Contribution
The paper presents a novel full-duplex dialogue system based on LLMs with neural FSM, enabling simultaneous listening and speaking with reduced latency and higher interruption precision.
Findings
Response latency reduced by over three times compared to half-duplex systems.
System responds within 500 ms in over 50% of interactions.
Achieves 8% higher interruption accuracy with an 8-billion-parameter LLM.
Abstract
We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate in tandem, allowing the system to speak and listen to the user simultaneously. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Robotics and Automated Systems · Speech Recognition and Synthesis
MethodsAttentive Walk-Aggregating Graph Neural Network
