UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

Dongchao Yang; Yuanyuan Wang; Dading Chong; Songxiang Liu; Xixin Wu; Helen Meng

arXiv:2602.04683·cs.SD·February 12, 2026

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

UniAudio 2.0 introduces a unified audio language model with a novel reasoning-based audio tokenizer and an autoregressive architecture, enabling effective understanding, generation, and zero-shot generalization across diverse audio tasks.

Contribution

It proposes ReasoningCodec, a discrete audio tokenizer with high-level reasoning and high-fidelity reconstruction, and a unified autoregressive model trained on large-scale text and audio data.

Findings

01

Achieves understanding comparable to continuous representations.

02

Improves audio generation quality and reconstruction fidelity.

03

Demonstrates strong few-shot and zero-shot generalization.

Abstract

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing