Training dynamic models using early exits for automatic speech recognition on resource-constrained devices
George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj,, Lucas Ondel Yang, Daniele Falavigna, Mohamed Nabih Ali, Alessio Brutti

TL;DR
This paper investigates training strategies for early-exit architectures in self-attention based automatic speech recognition models, demonstrating that training from scratch improves performance and exploring posterior probability-based exit selection.
Contribution
It compares fine-tuning pre-trained models versus training from scratch for early-exit ASR models, showing scratch training enhances accuracy and performance.
Findings
Early-exit models trained from scratch outperform fine-tuned models.
Scratch-trained models maintain performance with fewer encoder layers.
Posterior probability-based exit selection is effective.
Abstract
The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech recognition (ASR), early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. Previous research on early-exiting ASR models has relied on pre-trained self-supervised models, fine-tuned with an early-exit loss. In this paper, we undertake an experimental comparison between fine-tuning pre-trained backbones and training models from scratch with the early-exiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
