Fast offline Transformer-based end-to-end automatic speech recognition for real-world applications
Yoo Rhee Oh, Kiyoung Park, Jeon Gyu Park

TL;DR
This paper introduces a fast, Transformer-based end-to-end speech recognition system optimized for real-world applications, achieving rapid processing of large speech datasets with high accuracy and efficiency.
Contribution
It proposes novel techniques to accelerate Transformer-based speech recognition, including multi-utterance batched decoding, end-of-speech detection with CTC, and speech segmentation, demonstrating significant speed improvements.
Findings
Recognizes 8 hours of speech in under 3 minutes.
Achieves a 10.73% character error rate on real-world data.
Reduces recognition time by 27.1% compared to conventional systems.
Abstract
With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more important than ever. This paper proposes a method to rapidly recognize a large speech database via a Transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance batched beam search, detecting end-of-speech based on a connectionist temporal classification (CTC), restricting the CTC prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech English and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
