Investigating End-to-End ASR Architectures for Long Form Audio Transcription
Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra, Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper evaluates various end-to-end ASR architectures on long-form audio, finding that self-attention models with local and global attention outperform others in accuracy and robustness, especially with CTC decoders.
Contribution
It provides a comparative analysis of convolutional, squeeze-and-excitation, and attention-based ASR models on long audio, highlighting the superior performance of self-attention with CTC decoding.
Findings
Self-attention models achieve the lowest Word Error Rate.
CTC-based models are more robust and efficient than RNNT for long audio.
Self-attention with local and global tokens outperforms other architectures.
Abstract
This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
