Hallucination of speech recognition errors with sequence to sequence learning
Prashant Serai, Vishal Sunder, Eric Fosler-Lussier

TL;DR
This paper introduces end-to-end models that directly predict hallucinated ASR errors from text and phoneme sequences, improving error simulation and robustness in downstream spoken language understanding tasks.
Contribution
It presents novel sequence-to-sequence models for hallucinating ASR errors, surpassing prior phonetic-based methods and enabling better data augmentation for robustness.
Findings
Improved recall of hallucinated errors for in-domain and out-of-domain ASR systems.
Enhanced robustness of spoken question classifiers using hallucinated errors for training.
Effective error simulation even with limited test data.
Abstract
Automatic Speech Recognition (ASR) is an imperfect process that results in certain mismatches in ASR output text when compared to plain written text or transcriptions. When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy to reduce said mismatch and prevent degradations, is to hallucinate what the ASR outputs would be given a gold transcription. Prior work in this domain has focused on modeling errors at the phonetic level, while using a lexicon to convert the phones to words, usually accompanied by an FST Language model. We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence. This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
