ASR Error Detection via Audio-Transcript entailment

Nimshi Venkat Meripo; Sandeep Konam

arXiv:2207.10849·cs.CL·July 25, 2022

ASR Error Detection via Audio-Transcript entailment

Nimshi Venkat Meripo, Sandeep Konam

PDF

Open Access

TL;DR

This paper introduces an innovative end-to-end audio-transcript entailment model for detecting ASR errors, especially in medical conversations, significantly improving error detection accuracy over existing methods.

Contribution

It is the first to frame ASR error detection as an end-to-end entailment task between audio and transcript segments, combining acoustic and linguistic encoders.

Findings

01

Achieved 26.2% CER on all errors, 23% on medical errors

02

Improved baseline performance by 12% and 15.4% respectively

03

Effective in medical domain error detection

Abstract

Despite improved performances of the latest Automatic Speech Recognition (ASR) systems, transcription errors are still unavoidable. These errors can have a considerable impact in critical domains such as healthcare, when used to help with clinical documentation. Therefore, detecting ASR errors is a critical first step in preventing further error propagation to downstream applications. To this end, we propose a novel end-to-end approach for ASR error detection using audio-transcript entailment. To the best of our knowledge, we are the first to frame this problem as an end-to-end entailment task between the audio segment and its corresponding transcript segment. Our intuition is that there should be a bidirectional entailment between audio and transcript when there is no recognition error and vice versa. The proposed model utilizes an acoustic encoder and a linguistic encoder to model the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling