Predicting Multi-Codebook Vector Quantization Indexes for Knowledge   Distillation

Liyong Guo; Xiaoyu Yang; Quandong Wang; Yuxiang Kong; Zengwei Yao; Fan; Cui; Fangjun Kuang; Wei Kang; Long Lin; Mingshuang Luo; Piotr Zelasko; Daniel; Povey

arXiv:2211.00508·eess.AS·November 2, 2022

Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation

Liyong Guo, Xiaoyu Yang, Quandong Wang, Yuxiang Kong, Zengwei Yao, Fan, Cui, Fangjun Kuang, Wei Kang, Long Lin, Mingshuang Luo, Piotr Zelasko, Daniel, Povey

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces MVQ-KD, a novel knowledge distillation framework that compresses teacher embeddings into codebook indexes, significantly reducing storage needs while maintaining performance in speech recognition tasks.

Contribution

The paper proposes a new Multi-codebook Vector Quantization approach for knowledge distillation that reduces storage requirements and speeds up training in speech recognition models.

Findings

01

Achieves comparable performance to traditional KD methods with 256x less storage.

02

Results show 13.8% and 8.2% relative WERR on LibriSpeech test sets.

03

Provides significant efficiency improvements in speech recognition training.

Abstract

Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training speed is significantly slower as the teacher model has to be evaluated every batch. In this paper, we reformulate the generation of teacher label as a codec problem. We propose a novel Multi-codebook Vector Quantization (MVQ) approach that compresses teacher embeddings to codebook indexes (CI). Based on this, a KD training framework (MVQ-KD) is proposed where a student model predicts the CI generated from the embeddings of a self-supervised pre-trained teacher model. Experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/icefall
pytorchOfficial

Models

🤗
marcoyang/pruned_transducer_stateless6_hubert_xtralarge_ll60k_finetune_ls960
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsNetwork On Network · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings