Collaborative Training of Acoustic Encoders for Speech Recognition

Varun Nagaraja; Yangyang Shi; Ganesh Venkatesh; Ozlem Kalinli; Michael; L. Seltzer; Vikas Chandra

arXiv:2106.08960·cs.CL·July 15, 2021

Collaborative Training of Acoustic Encoders for Speech Recognition

Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael, L. Seltzer, Vikas Chandra

PDF

TL;DR

This paper introduces a collaborative training method for acoustic encoders of varying sizes in speech recognition, leveraging shared knowledge and co-distillation to improve performance across models.

Contribution

It proposes a joint training approach with shared modules and co-distillation for different-sized encoders, enhancing efficiency and accuracy in speech recognition models.

Findings

01

Up to 11% relative WER improvement on LibriSpeech

02

Shared encoder training benefits multiple model sizes

03

Co-distillation enhances frame-level prediction accuracy

Abstract

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.