Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation
Nithin Rao Koluguri, Travis Bartley, Hainan Xu, Oleksii Hrinchuk,, Jagadeesh Balam, Boris Ginsburg, Georg Kucsko

TL;DR
This paper introduces a method for training speech recognition and translation models on longer, punctuated utterances using the FastConformer architecture, resulting in significant accuracy improvements and better handling of complete sentences.
Contribution
It proposes training on longer utterances with full punctuation and capitalization, leveraging FastConformer for sequences up to 60 seconds, enhancing model performance over traditional short-segment training.
Findings
25% relative WER improvement on Earnings benchmarks
Enhanced punctuation and capitalization accuracy
Increased overall speech recognition and translation accuracy
Abstract
This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capitalization (PnC) sentences, we propose training on longer utterances that include complete sentences with proper punctuation and capitalization. We achieve this by using the FastConformer architecture which allows training 1 Billion parameter models with sequences up to 60 seconds long with full attention. However, while training with PnC enhances the overall performance, we observed that accuracy plateaus when training on sequences longer than 40 seconds across various evaluation settings. Our proposed method significantly improves punctuation and capitalization accuracy, showing a 25% relative word error rate (WER) improvement on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
