Don't Be So Sure! Boosting ASR Decoding via Confidence Relaxation
Tomer Wullach, Shlomo E. Chazan

TL;DR
This paper introduces a decoding method for ASR that relaxes model confidence and aggregates information from multiple layers, improving recognition performance especially in low-resource settings without extra training or parameters.
Contribution
It proposes a confidence relaxation and layer aggregation technique for ASR decoding that enhances performance without additional training or model complexity.
Findings
Improves ASR decoding accuracy across various resource levels.
Reduces inference computation compared to existing methods.
Shows consistent gains especially in low-resource scenarios.
Abstract
Automatic Speech Recognition (ASR) systems frequently use a search-based decoding strategy aiming to find the best attainable transcript by considering multiple candidates. One prominent speech recognition decoding heuristic is beam search, which seeks the transcript with the greatest likelihood computed using the predicted distribution. While showing substantial performance gains in various tasks, beam search loses some of its effectiveness when the predicted probabilities are highly confident, i.e., the predicted distribution is massed for a single or very few classes. We show that recently proposed Self-Supervised Learning (SSL)-based ASR models tend to yield exceptionally confident predictions that may hamper beam search from truly considering a diverse set of candidates. We perform a layer analysis to reveal and visualize how predictions evolve, and propose a decoding procedure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
