Updated Corpora and Benchmarks for Long-Form Speech Recognition

Jennifer Drexler Fox; Desh Raj; Natalie Delworth; Quinn McNamara,; Corey Miller; Mig\"uel Jett\'e

arXiv:2309.15013·cs.CL·September 27, 2023

Updated Corpora and Benchmarks for Long-Form Speech Recognition

Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara,, Corey Miller, Mig\"uel Jett\'e

PDF

Open Access 1 Repo

TL;DR

This paper updates popular ASR corpora for long-form speech recognition, investigates train-test mismatch issues, especially affecting AEDs, and demonstrates that long-form training improves model robustness in real-world scenarios.

Contribution

It re-releases key ASR datasets with updated annotations for long-form speech research and benchmarks training methods to address domain shift issues.

Findings

01

AEDs are more affected by train-test mismatch than transducers.

02

Long-form training improves robustness to domain shift.

03

Updated corpora facilitate research on real-world long-form speech recognition.

Abstract

The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

revdotcom/speech-datasets
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems