Updated Corpora and Benchmarks for Long-Form Speech Recognition
Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara,, Corey Miller, Mig\"uel Jett\'e

TL;DR
This paper updates popular ASR corpora for long-form speech recognition, investigates train-test mismatch issues, especially affecting AEDs, and demonstrates that long-form training improves model robustness in real-world scenarios.
Contribution
It re-releases key ASR datasets with updated annotations for long-form speech research and benchmarks training methods to address domain shift issues.
Findings
AEDs are more affected by train-test mismatch than transducers.
Long-form training improves robustness to domain shift.
Updated corpora facilitate research on real-world long-form speech recognition.
Abstract
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
