Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation
Kumud Tripathi, Aditya Srinivas Menon, Aman Gaurav, Raj Prakash Gohil, Pankaj Wasnik

TL;DR
This paper introduces a novel two-stage approach combining Adaptive Layer Attention and knowledge distillation to reduce hallucinations in Whisper speech recognition, especially under noisy conditions, improving robustness and accuracy.
Contribution
It proposes a new architecture that enhances Whisper's robustness through adaptive layer attention and knowledge distillation, directly addressing hallucination errors in noisy environments.
Findings
Significant reduction in hallucinations and word error rates in noisy conditions
Improved robustness of Whisper model without sacrificing clean speech performance
Effective use of multi-objective knowledge distillation for noise robustness
Abstract
The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content. However, modifications to the Whisper model itself remain largely unexplored to mitigate hallucinations directly. To address this challenge, we present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework. In the first stage, ALA groups encoder layers into semantically coherent blocks via inter-layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Hearing Loss and Rehabilitation
