Human and Automatic Speech Recognition Performance on German Oral History Interviews
Michael Gref, Nike Matthiesen, Christoph Schmidt, Sven Behnke, Joachim, K\"ohler

TL;DR
This study compares human and automatic transcription accuracy on German oral history interviews, revealing a significant gap and demonstrating improvements in machine models through adaptation techniques.
Contribution
It provides the first detailed comparison of human and machine transcription performance on German oral history data and explores model adaptation for improved accuracy.
Findings
Human WER estimated at 8.7% for clean interviews
Machine models achieved 23.9% WER on noisy data
Model adaptation improved accuracy by 5-8%
Abstract
Automatic speech recognition systems have accomplished remarkable improvements in transcription accuracy in recent years. On some domains, models now achieve near-human performance. However, transcription performance on oral history has not yet reached human accuracy. In the present work, we investigate how large this gap between human and machine transcription still is. For this purpose, we analyze and compare transcriptions of three humans on a new oral history data set. We estimate a human word error rate of 8.7% for recent German oral history interviews with clean acoustic conditions. For comparison with recent machine transcription accuracy, we present experiments on the adaptation of an acoustic model achieving near-human performance on broadcast speech. We investigate the influence of different adaptation data on robustness and generalization for clean and noisy oral history…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
