Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang,, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun,, Zenan Zhou, Weipeng Chen

TL;DR
Baichuan-Audio is a comprehensive end-to-end speech interaction model that combines audio understanding and generation, utilizing a novel multi-codebook discretization and a two-stage pre-training strategy for real-time speech-based conversations.
Contribution
It introduces a unified framework with a text-guided speech generation mechanism and a two-stage pre-training approach to enhance audio and language understanding capabilities.
Findings
Superior performance in real-time spoken dialogue
Effective question-answering capabilities
Maintains language understanding during audio modeling
Abstract
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
