Avoid Overthinking in Self-Supervised Models for Speech Recognition
Dan Berrebbi, Brian Yan, Shinji Watanabe

TL;DR
This paper investigates overthinking in self-supervised speech recognition models, demonstrating the issue and proposing two novel early exit strategies to improve inference efficiency and accuracy, especially on out-of-distribution data.
Contribution
It introduces the first analysis of overthinking in SSL-based ASR and proposes two new early exit strategies tailored for speech recognition tasks.
Findings
SSL models exhibit overthinking in ASR.
Proposed strategies outperform previous early exit methods.
Optimal performance-speed trade-off bounds are computed.
Abstract
Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically reducing computations at inference time for certain samples. Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate. This challenge is further compounded when speech SSL models are applied on out-of-distribution (OOD) data. This paper first shows that SSL models do overthinking in ASR. We then motivate further research in EE by computing an optimal bound for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
