Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
Berkin Durmus, Chen Cen, Eduardo Pacheco, Arda Okan, Atila Orhon

TL;DR
This paper introduces Contextual Earnings-22, a new benchmark dataset for speech recognition with custom vocabulary, highlighting the importance of contextual conditioning in real-world applications.
Contribution
It provides a standardized benchmark dataset with realistic custom vocabulary contexts and evaluates strong baseline approaches for contextual speech recognition.
Findings
Both keyword prompting and boosting approaches improve accuracy significantly at scale.
Experiments demonstrate comparable performance of the two approaches on the new benchmark.
The benchmark reveals latent progress potential in contextual speech-to-text systems.
Abstract
The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
