TL;DR
This study evaluates the real-world accuracy of eleven popular ASR services in diverse conditions, revealing significant variability and lower reliability, especially in streaming scenarios, despite claims of near-human performance.
Contribution
It provides an independent, comprehensive assessment of ASR accuracy in educational settings, highlighting discrepancies between reported and actual performance.
Findings
Accuracy varies widely across ASR vendors.
Streaming ASR performs significantly worse than offline.
ASR reliability remains a concern despite recent advancements.
Abstract
For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
