DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, A K M Mahbubur Rahman, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Md Mofijul Islam, Amin Ahsan Ali

TL;DR
DM-Codec introduces a novel speech tokenizer that distills acoustic, semantic, and contextual speech representations, significantly improving speech recognition accuracy and quality over existing models.
Contribution
The paper proposes two new distillation methods incorporating contextual information into speech tokenization, leading to the development of the DM-Codec model that outperforms state-of-the-art approaches.
Findings
Reduces Word Error Rate by up to 13.46%
Improves speech intelligibility by 1.85%
Enhances speech quality by 5.84% on LibriSpeech
Abstract
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The only strength of this paper is the idea of leveraging a text-based language model to improve audio codec, which is reasonable and interesting. That being said, the idea is not properly executed in this work (see weaknesses).
- **Novelty**: DM-codec is essentially SpeechTokenizer with additional LM distillation, which is somewhat incremental. - **Technical correctness**: While the extra text-based target embedding is the only novel component, it's unclear how the text representation and the acoustic representation are aligned. Text embeddings and acoustic embeddings are clearly **not** sharing the same length, but this work did not provide any detail on this. Instead, it uses a misleading notation $n$ to indicate the
The proposed DM-Codec method improved the existing speech tokenization system which only incorporates acoustic and semantic information (e.g SpeechTokenizer) by adding textual representations via an LM-guided distillation method. The contextual information is learned with continuous representation distillation technique, and it’s then combined with the speech self-supervised learning model (SM) guided distillation. The proposed method is compared with several existing methods, including EnCod
• The testing set is small. Only 300 audio samples are randomly selected from the LibriSpeech test subset as the evaluation set. Although the author explained that this is to make it consistent with the baseline, but 300 audios are not big enough to get a statistically meaningful conclusion. The results in table 3 maybe a proof of this: We couldn’t get obvious relation between WER and SM/LM loss weights based on these results. E.g. the author said LM information is more helpful for lower WER bu
This work first incorporates contextual representations via an LM-guided distillation method, and it enhancs the retention of acoustic and speech information in reconstructed speech.
I do not fully agree that contextual representation should hold equal importance to acoustic and semantic representations. The improved intelligibility of DM-Codec is mainly due to an additional teacher LM serving a distillation role within the RVQ. However, this LM relies on an ASR system (M_STT in Fig2) for transcription. Given this setup, it is difficult to ascertain whether the improvements are driven by the Bert LM or by the M_STT ASR system. More importantly, the reason audio codecs are b
The authors conducted significance tests, an important analysis that demonstrated the performance gap between methods but was not included in most prior literature. Moreover, the ablation studies are comprehensive and cover most aspects of the method design.
1) **LM Guided Distillation (learning target):** According to the text, the authors did not explicitly mention how they aligned the codec encoder features with the LM hidden representations. It is a commonly known fact that transcribed speech has a significantly shorter utterance length, especially when the transcription is tokenized as words or subword units. Hence, the LM representations of the transcriptions must be shorter than the codec encoder output, leading to a length mismatch when co
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
