An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Abdul Hannan; Alessio Brutti; Shah Nawaz; Mubashir Noman

arXiv:2505.16991·cs.CV·May 29, 2025

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman

PDF

Open Access

TL;DR

This paper proposes a two-step representation learning framework that efficiently produces small, high-performing speech recognition models from a large model, significantly reducing training time and improving accuracy on benchmarks.

Contribution

It introduces a novel two-step learning approach that creates multiple small ASR models from one large model with minimal training epochs, outperforming existing methods.

Findings

01

Achieves up to 12.54% WER reduction

02

Provides three-fold training speed-up

03

Produces multiple small models from a single large model

Abstract

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing