Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition
Robert Flynn, Anton Ragni

TL;DR
This study demonstrates that modern attention-based speech recognition models can effectively process very long audio sequences, significantly improving performance by leveraging extended context up to an hour.
Contribution
The paper shows that current hardware and algorithms enable training ASR models on sequences over an hour long, surpassing traditional short-utterance limitations.
Findings
Using up to 21.8 minutes of context improves accuracy by 14.2%.
Encoding positional information and model architecture are crucial for long-sequence performance.
Models utilize both linguistic and acoustic information from distant context.
Abstract
Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
