Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

Yingzhi Wang; Anas Alhmoud; Saad Alsahly; Muhammad Alqurishi; Mirco Ravanelli

arXiv:2505.12969·cs.CL·May 20, 2025

Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

Yingzhi Wang, Anas Alhmoud, Saad Alsahly, Muhammad Alqurishi, Mirco Ravanelli

PDF

Open Access

TL;DR

Calm-Whisper is a novel method that significantly reduces hallucinations in Whisper's non-speech segments by fine-tuning specific self-attention heads, improving reliability without increasing word error rate.

Contribution

This paper identifies key self-attention heads responsible for hallucinations and fine-tunes them to mitigate errors without additional pre- or post-processing.

Findings

01

Over 80% reduction in non-speech hallucinations

02

Less than 0.1% WER degradation on LibriSpeech

03

Only 3 heads out of 20 are responsible for most hallucinations

Abstract

OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Adversarial Robustness in Machine Learning