DashengTokenizer: One layer is enough for unified audio understanding and generation
Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, Yadong Niu, Jizhong Liu, Xiyang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan

TL;DR
DashengTokenizer is a novel continuous audio tokenizer that leverages frozen semantic features and acoustic injection, enabling superior performance in understanding and generation tasks across diverse audio applications.
Contribution
It introduces a unified audio tokenizer that inverts traditional paradigms by using frozen semantic features with acoustic injection, improving performance without VAE-based architectures.
Findings
Outperforms previous audio codec and encoder baselines in 22 tasks
Enhances performance in speech emotion, music understanding, and scene classification
Surpasses VAE-based methods in text-to-audio and text-to-music tasks
Abstract
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition
