An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman

TL;DR
This paper proposes a two-step representation learning framework that efficiently produces small, high-performing speech recognition models from a large model, significantly reducing training time and improving accuracy on benchmarks.
Contribution
It introduces a novel two-step learning approach that creates multiple small ASR models from one large model with minimal training epochs, outperforming existing methods.
Findings
Achieves up to 12.54% WER reduction
Provides three-fold training speed-up
Produces multiple small models from a single large model
Abstract
Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
