UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
Yadong Li, Guoxin Wu, Haiping Hou, Biye Li

TL;DR
This paper introduces UAF, a unified audio front-end large language model designed for full-duplex speech systems, improving responsiveness and accuracy by integrating multiple audio tasks into a single autoregressive model.
Contribution
It is the first to unify diverse audio front-end tasks into one autoregressive LLM tailored for full-duplex speech interaction, addressing latency and error propagation issues.
Findings
Achieves leading performance across multiple audio front-end tasks.
Significantly improves response latency in real-world scenarios.
Enhances interruption accuracy in full-duplex speech systems.
Abstract
Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
