TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression   For On-device ASR Models

Yuan Shangguan; Haichuan Yang; Danni Li; Chunyang Wu; Yassir; Fathullah; Dilin Wang; Ayushi Dalmia; Raghuraman Krishnamoorthi; Ozlem; Kalinli; Junteng Jia; Jay Mahadeokar; Xin Lei; Mike Seltzer; Vikas Chandra

arXiv:2309.01947·cs.CL·November 28, 2023

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir, Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem, Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra

PDF

Open Access

TL;DR

TODM introduces an efficient Supernet-based training method for RNN-T models, enabling the deployment of multiple hardware-optimized on-device ASR models with minimal additional training cost.

Contribution

It presents a novel Supernet training approach with techniques like adaptive dropouts and knowledge distillation, achieving comparable or better performance than individually tuned models.

Findings

01

Supernet matches or surpasses manually tuned models in WER by up to 3%.

02

Training many models costs roughly the same as a single training job.

03

The approach is validated on LibriSpeech with promising results.

Abstract

Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques