WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio   Language Modeling

Shengpeng Ji; Ziyue Jiang; Wen Wang; Yifu Chen; Minghui Fang; Jialong; Zuo; Qian Yang; Xize Cheng; Zehan Wang; Ruiqi Li; Ziang Zhang; Xiaoda Yang,; Rongjie Huang; Yidi Jiang; Qian Chen; Siqi Zheng; Zhou Zhao

arXiv:2408.16532·eess.AS·February 26, 2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong, Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, Ziang Zhang, Xiaoda Yang,, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Zhou Zhao

PDF

Open Access 1 Repo 10 Models 1 Video

TL;DR

WavTokenizer is a novel acoustic codec tokenizer that achieves extreme compression and high-quality audio reconstruction, enabling efficient and semantically rich audio modeling across speech, music, and audio domains.

Contribution

It introduces a new VQ-based tokenizer with broader VQ space, extended context, and advanced attention, significantly improving compression and quality over previous models.

Findings

01

Achieves state-of-the-art subjective and objective audio quality.

02

Requires only a single quantizer for one-second audio at 24kHz.

03

Demonstrates strong performance across speech, music, and audio reconstruction tasks.

Abstract

Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jishengpeng/wavtokenizer
pytorchOfficial

Models

Videos

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSoftmax · Attention Is All You Need