dMel: Speech Tokenization made Simple
Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

TL;DR
This paper introduces dmel, a simple, robust, and training-free speech tokenization method that discretizes mel-filterbank channels, enabling effective speech synthesis and recognition with a unified transformer-based framework.
Contribution
The paper presents a novel discretization of mel-filterbank channels into intensity bins, simplifying speech tokenization and improving robustness and performance over existing methods.
Findings
dmel outperforms existing tokenization methods in preserving audio content
dmel demonstrates robustness to out-of-domain audio signals
RichTTS and RichASR achieve comparable or better results than specialized models
Abstract
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation.…
Peer Reviews
Decision·Submitted to ICLR 2025
- The proposed method is efficient, as it avoids hierarchical dependencies among mel-spectrogram channels, allowing for independent modeling of each channel within each frame using a straightforward, decoder-only (LM-style) transformer architecture. - The approach is robust, simple yet innovative, with comprehensive evaluations that support the design choices. - The encoder operates independently of the decoder, unlike many other tokenizers, making it compatible with any vocoder that accepts mel
- The evaluation could be more thorough by incorporating existing benchmarks such as Codec-Superb and DASB, allowing for a more comprehensive comparison of the proposed method against existing models under standardized settings. - The related works section could be expanded to include methods that use frequency domain inputs, such as those discussed in the following papers: - https://arxiv.org/pdf/2406.05298 - https://arxiv.org/pdf/2201.09429 - https://arxiv.org/pdf/2405.00233
- The idea of quantizing mel spectrogram as tokenization is interesting and simple (in a good way). - Results on TTS and ASR show dMel quantization has a small impact on models trained on continuous representation, training downstream models on top of dMel also provided similar results to their continuous counterparts. These observations are interesting, showcasing the generalizability of dMel. - Overall, I believe dMel is much more efficient in terms of model size and inference speed comparing
- As a speech tokenization paper, this work lacks a discussion on the overall bit rate for compression besides frame rate. Especially in the comparison with the prior works (e.g., Table 3). dMel is over 12.8kbps~5kbps (assuming 40 fps $\times$ 32 mel filters $\times$ 4 bit-per-filter)~, which is higher than Hubert-KM and Speech Tokenizer. - This paper spent most of the space discussing ASR & TTS systems based on dMel. While the numbers are good, it is still not as good as a normal mel spectro
The proposed dMel mitigates the issues in existing speech tokenizers. First, prior works like self-supervised learning (SSL) based tokenizers require extensive pre-training and sometimes not being able to preserve acoustic details for speech generation and synthesis. Second, neural codecs preserve fine-grained acoustic representations but might not be able to perform ASR and TTS because of the weak correlations between codebooks and frames. The authors propose a parameter- and training-free appr
Despite the success of the dMel method presented in the experiment results, the following issues question its novelty and effectiveness. 1) **Bitrate:** Bitrate is a crucial metric for comparing different tokenizers in prior studies but is not included in this paper. According to the provided information, dMel@40Hz, HuBERT-KM, and SpeechTokenizer, respectively, have bitrates of 12.8, 0.4, and 4kbps. The huge difference in bitrates might lead to an **unfair comparison**. Moreover, the number o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis
