UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

TL;DR
UniAudio 2.0 introduces a unified audio language model with a novel reasoning-based audio tokenizer and an autoregressive architecture, enabling effective understanding, generation, and zero-shot generalization across diverse audio tasks.
Contribution
It proposes ReasoningCodec, a discrete audio tokenizer with high-level reasoning and high-fidelity reconstruction, and a unified autoregressive model trained on large-scale text and audio data.
Findings
Achieves understanding comparable to continuous representations.
Improves audio generation quality and reconstruction fidelity.
Demonstrates strong few-shot and zero-shot generalization.
Abstract
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
