UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Chengwei Liu; Haoyin Yan; Shaofei Xue; Xiaotao Liang; Yinghao Liu; Zheng Xue; Gang Song; Boyang Zhou

arXiv:2510.26372·cs.SD·October 31, 2025

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou

PDF

2 Models

TL;DR

UniTok-Audio introduces a unified, scalable framework for diverse audio generation tasks using generative modeling on discrete codec tokens, improving quality and generalization across multiple applications.

Contribution

It proposes a novel unified framework that leverages discrete tokens, task identifiers, and dual-stream codecs for versatile and high-quality audio generation.

Findings

01

Achieves competitive performance across five audio tasks.

02

Demonstrates high-fidelity waveform reconstruction.

03

Unifies multiple audio tasks in a single model.

Abstract

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.