WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong, Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, Ziang Zhang, Xiaoda Yang,, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Zhou Zhao

TL;DR
WavTokenizer is a novel acoustic codec tokenizer that achieves extreme compression and high-quality audio reconstruction, enabling efficient and semantically rich audio modeling across speech, music, and audio domains.
Contribution
It introduces a new VQ-based tokenizer with broader VQ space, extended context, and advanced attention, significantly improving compression and quality over previous models.
Findings
Achieves state-of-the-art subjective and objective audio quality.
Requires only a single quantizer for one-second audio at 24kHz.
Demonstrates strong performance across speech, music, and audio reconstruction tasks.
Abstract
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗novateur/WavTokenizermodel· ♡ 55♡ 55
- 🤗novateur/WavTokenizer-medium-speech-75tokenmodel· ♡ 2♡ 2
- 🤗novateur/WavTokenizer-medium-music-audio-75tokenmodel· ♡ 6♡ 6
- 🤗novateur/WavTokenizer-large-unify-75tokenmodel· ♡ 4♡ 4
- 🤗novateur/WavTokenizer-large-unify-40tokenmodel· ♡ 6♡ 6
- 🤗novateur/WavTokenizer-large-speech-75tokenmodel· ♡ 14♡ 14
- 🤗ByteDance/MegaTTS3model· 113 dl· ♡ 416113 dl♡ 416
- 🤗RedbeardNZ/MegaTTS3model· 3 dl3 dl
- 🤗bosonai/higgs-audio-v2-tokenizermodel· 1.9k dl· ♡ 461.9k dl♡ 46
- 🤗drbaph/MegaTTS3-WaveVAEmodel· 31 dl· ♡ 531 dl♡ 5
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsSoftmax · Attention Is All You Need
