Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training
Lillian Zhou, Dhruv Guliani, Andreas Kabel, Giovanni Motta,, Fran\c{c}oise Beaufays

TL;DR
This paper investigates the layered structure of Transformer-based ASR models, revealing layer importance and stability, and applies these insights to enhance federated training efficiency without sacrificing recognition quality.
Contribution
It uncovers layer heterogeneity in Conformer models, explores normalization effects, and introduces federated dropout targeting important layers for efficient training.
Findings
Multiple ambient layers exist in Conformer models.
Group normalization can be used without disrupting layer importance.
Federated Dropout targeting important layers reduces model size without quality loss.
Abstract
Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose that group normalization may be used without disrupting their formation, and examine their correlation with model weight updates in each layer. Finally, we apply these findings to Federated Learning in order to improve the training procedure, by targeting Federated Dropout to layers by importance. This allows us to reduce the model size optimized by clients without quality degradation, and shows potential for future exploration.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsGroup Normalization · Dropout
