dMel: Speech Tokenization made Simple

Richard He Bai; Tatiana Likhomanenko; Ruixiang Zhang; Zijin Gu; Zakaria Aldeneh; Navdeep Jaitly

arXiv:2407.15835·cs.CL·May 22, 2025·2 cites

dMel: Speech Tokenization made Simple

Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces dmel, a simple, robust, and training-free speech tokenization method that discretizes mel-filterbank channels, enabling effective speech synthesis and recognition with a unified transformer-based framework.

Contribution

The paper presents a novel discretization of mel-filterbank channels into intensity bins, simplifying speech tokenization and improving robustness and performance over existing methods.

Findings

01

dmel outperforms existing tokenization methods in preserving audio content

02

dmel demonstrates robustness to out-of-domain audio signals

03

RichTTS and RichASR achieve comparable or better results than specialized models

Abstract

Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

- The proposed method is efficient, as it avoids hierarchical dependencies among mel-spectrogram channels, allowing for independent modeling of each channel within each frame using a straightforward, decoder-only (LM-style) transformer architecture. - The approach is robust, simple yet innovative, with comprehensive evaluations that support the design choices. - The encoder operates independently of the decoder, unlike many other tokenizers, making it compatible with any vocoder that accepts mel

Weaknesses

- The evaluation could be more thorough by incorporating existing benchmarks such as Codec-Superb and DASB, allowing for a more comprehensive comparison of the proposed method against existing models under standardized settings. - The related works section could be expanded to include methods that use frequency domain inputs, such as those discussed in the following papers: - https://arxiv.org/pdf/2406.05298 - https://arxiv.org/pdf/2201.09429 - https://arxiv.org/pdf/2405.00233

Reviewer 02Rating 5Confidence 4

Strengths

- The idea of quantizing mel spectrogram as tokenization is interesting and simple (in a good way). - Results on TTS and ASR show dMel quantization has a small impact on models trained on continuous representation, training downstream models on top of dMel also provided similar results to their continuous counterparts. These observations are interesting, showcasing the generalizability of dMel. - Overall, I believe dMel is much more efficient in terms of model size and inference speed comparing

Weaknesses

- As a speech tokenization paper, this work lacks a discussion on the overall bit rate for compression besides frame rate. Especially in the comparison with the prior works (e.g., Table 3). dMel is over 12.8kbps~5kbps (assuming 40 fps $\times$ 32 mel filters $\times$ 4 bit-per-filter)~, which is higher than Hubert-KM and Speech Tokenizer. - This paper spent most of the space discussing ASR & TTS systems based on dMel. While the numbers are good, it is still not as good as a normal mel spectro

Reviewer 03Rating 3Confidence 5

Strengths

The proposed dMel mitigates the issues in existing speech tokenizers. First, prior works like self-supervised learning (SSL) based tokenizers require extensive pre-training and sometimes not being able to preserve acoustic details for speech generation and synthesis. Second, neural codecs preserve fine-grained acoustic representations but might not be able to perform ASR and TTS because of the weak correlations between codebooks and frames. The authors propose a parameter- and training-free appr

Weaknesses

Despite the success of the dMel method presented in the experiment results, the following issues question its novelty and effectiveness. 1) **Bitrate:** Bitrate is a crucial metric for comparing different tokenizers in prior studies but is not included in this paper. According to the provided information, dMel@40Hz, HuBERT-KM, and SpeechTokenizer, respectively, have bitrates of 12.8, 0.4, and 4kbps. The huge difference in bitrates might lead to an **unfair comparison**. Moreover, the number o

Code & Models

Repositories

apple/dmel
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis