Comparing Human and Machine Errors in Conversational Speech Transcription
Andreas Stolcke, Jasha Droppo

TL;DR
This paper compares human and machine transcription errors in conversational speech, revealing overlapping error types but also specific differences, and assesses how well humans can distinguish between the two sources.
Contribution
It systematically analyzes and quantifies differences between human and machine errors in conversational speech transcription, including error overlap and distinguishability.
Findings
High overlap in substitution, deletion, and insertion errors between human and machine transcriptions.
Machine transcriptions often confuse filled pauses and backchannel acknowledgments, unlike humans.
Humans can often distinguish between human and machine transcription errors in an informal test.
Abstract
Recent work in automatic recognition of conversational telephone speech (CTS) has achieved accuracy levels comparable to human transcribers, although there is some debate how to precisely quantify human performance on this task, using the NIST 2000 CTS evaluation set. This raises the question what systematic differences, if any, may be found differentiating human from machine transcription errors. In this paper we approach this question by comparing the output of our most accurate CTS recognition system to that of a standard speech transcription vendor pipeline. We find that the most frequent substitution, deletion and insertion error types of both outputs show a high degree of overlap. The only notable exception is that the automatic recognizer tends to confuse filled pauses ("uh") and backchannel acknowledgments ("uhhuh"). Humans tend not to make this error, presumably due to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
