Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio
Mateusz Bara\'nski, Jan Jasi\'nski, Julitta Bartolewska, Stanis{\l}aw, Kacprzak, Marcin Witkowski, Konrad Kowalczyk

TL;DR
This paper investigates how non-speech sounds can induce hallucinations in the Whisper ASR model, and proposes a post-processing method to mitigate these hallucinations, improving transcription accuracy.
Contribution
It introduces a systematic study of hallucinations caused by non-speech audio in Whisper ASR and develops a bag of hallucinations (BoH) for post-processing to reduce errors.
Findings
Post-processing with BoH reduces word error rate (WER).
Certain non-speech sounds frequently induce hallucinations.
BoH acts as an effective safeguard against hallucinations.
Abstract
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCardiac electrophysiology and arrhythmias · Deception detection and forensic psychology · Electrostatic Discharge in Electronics
MethodsSparse Evolutionary Training
