TL;DR
UniTok-Audio introduces a unified, scalable framework for diverse audio generation tasks using generative modeling on discrete codec tokens, improving quality and generalization across multiple applications.
Contribution
It proposes a novel unified framework that leverages discrete tokens, task identifiers, and dual-stream codecs for versatile and high-quality audio generation.
Findings
Achieves competitive performance across five audio tasks.
Demonstrates high-fidelity waveform reconstruction.
Unifies multiple audio tasks in a single model.
Abstract
Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
