DashengTokenizer: One layer is enough for unified audio understanding and generation

Heinrich Dinkel; Xingwei Sun; Gang Li; Jiahao Mei; Yadong Niu; Jizhong Liu; Xiyang Li; Yifan Liao; Jiahao Zhou; Junbo Zhang; Jian Luan

arXiv:2602.23765·cs.SD·March 27, 2026

DashengTokenizer: One layer is enough for unified audio understanding and generation

Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, Yadong Niu, Jizhong Liu, Xiyang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan

PDF

Open Access 2 Models

TL;DR

DashengTokenizer is a novel continuous audio tokenizer that leverages frozen semantic features and acoustic injection, enabling superior performance in understanding and generation tasks across diverse audio applications.

Contribution

It introduces a unified audio tokenizer that inverts traditional paradigms by using frozen semantic features with acoustic injection, improving performance without VAE-based architectures.

Findings

01

Outperforms previous audio codec and encoder baselines in 22 tasks

02

Enhances performance in speech emotion, music understanding, and scene classification

03

Surpasses VAE-based methods in text-to-audio and text-to-music tasks

Abstract

This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition