Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation
Liyong Guo, Xiaoyu Yang, Quandong Wang, Yuxiang Kong, Zengwei Yao, Fan, Cui, Fangjun Kuang, Wei Kang, Long Lin, Mingshuang Luo, Piotr Zelasko, Daniel, Povey

TL;DR
This paper introduces MVQ-KD, a novel knowledge distillation framework that compresses teacher embeddings into codebook indexes, significantly reducing storage needs while maintaining performance in speech recognition tasks.
Contribution
The paper proposes a new Multi-codebook Vector Quantization approach for knowledge distillation that reduces storage requirements and speeds up training in speech recognition models.
Findings
Achieves comparable performance to traditional KD methods with 256x less storage.
Results show 13.8% and 8.2% relative WERR on LibriSpeech test sets.
Provides significant efficiency improvements in speech recognition training.
Abstract
Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training speed is significantly slower as the teacher model has to be evaluated every batch. In this paper, we reformulate the generation of teacher label as a codec problem. We propose a novel Multi-codebook Vector Quantization (MVQ) approach that compresses teacher embeddings to codebook indexes (CI). Based on this, a KD training framework (MVQ-KD) is proposed where a student model predicts the CI generated from the embeddings of a self-supervised pre-trained teacher model. Experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsNetwork On Network · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
