Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training   for Enhanced Speech Recognition and Translation

Nithin Rao Koluguri; Travis Bartley; Hainan Xu; Oleksii Hrinchuk,; Jagadeesh Balam; Boris Ginsburg; Georg Kucsko

arXiv:2409.05601·eess.AS·September 10, 2024

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Nithin Rao Koluguri, Travis Bartley, Hainan Xu, Oleksii Hrinchuk,, Jagadeesh Balam, Boris Ginsburg, Georg Kucsko

PDF

Open Access

TL;DR

This paper introduces a method for training speech recognition and translation models on longer, punctuated utterances using the FastConformer architecture, resulting in significant accuracy improvements and better handling of complete sentences.

Contribution

It proposes training on longer utterances with full punctuation and capitalization, leveraging FastConformer for sequences up to 60 seconds, enhancing model performance over traditional short-segment training.

Findings

01

25% relative WER improvement on Earnings benchmarks

02

Enhanced punctuation and capitalization accuracy

03

Increased overall speech recognition and translation accuracy

Abstract

This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capitalization (PnC) sentences, we propose training on longer utterances that include complete sentences with proper punctuation and capitalization. We achieve this by using the FastConformer architecture which allows training 1 Billion parameter models with sequences up to 60 seconds long with full attention. However, while training with PnC enhances the overall performance, we observed that accuracy plateaus when training on sequences longer than 40 seconds across various evaluation settings. Our proposed method significantly improves punctuation and capitalization accuracy, showing a 25% relative word error rate (WER) improvement on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques