Adopting Whisper for Confidence Estimation
Vaibhav Aggarwal, Shabari S Nair, Yash Verma, Yash Jogi

TL;DR
This paper introduces a novel end-to-end method using the Whisper speech recognition model to generate word-level confidence scores, outperforming traditional lightweight CEMs especially on out-of-domain datasets.
Contribution
It presents a fine-tuning approach for Whisper models to produce confidence scores, demonstrating superior performance over existing CEMs across multiple datasets.
Findings
Fine-tuned Whisper models match or surpass CEM performance.
Large Whisper model outperforms CEM on all datasets.
Out-of-domain performance significantly improved.
Abstract
Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
