Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Tianpeng Li; Jun Liu; Tao Zhang; Yuanbo Fang; Da Pan; Mingrui Wang,; Zheng Liang; Zehuan Li; Mingan Lin; Guosheng Dong; Jianhua Xu; Haoze Sun,; Zenan Zhou; Weipeng Chen

arXiv:2502.17239·cs.CL·February 25, 2025

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang,, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun,, Zenan Zhou, Weipeng Chen

PDF

Open Access 1 Repo 2 Models

TL;DR

Baichuan-Audio is a comprehensive end-to-end speech interaction model that combines audio understanding and generation, utilizing a novel multi-codebook discretization and a two-stage pre-training strategy for real-time speech-based conversations.

Contribution

It introduces a unified framework with a text-guided speech generation mechanism and a two-stage pre-training approach to enhance audio and language understanding capabilities.

Findings

01

Superior performance in real-time spoken dialogue

02

Effective question-answering capabilities

03

Maintains language understanding during audio modeling

Abstract

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baichuan-inc/baichuan-audio
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems