Investigating End-to-End ASR Architectures for Long Form Audio   Transcription

Nithin Rao Koluguri; Samuel Kriman; Georgy Zelenfroind; Somshubra; Majumdar; Dima Rekesh; Vahid Noroozi; Jagadeesh Balam; Boris Ginsburg

arXiv:2309.09950·eess.AS·September 22, 2023

Investigating End-to-End ASR Architectures for Long Form Audio Transcription

Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra, Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

This paper evaluates various end-to-end ASR architectures on long-form audio, finding that self-attention models with local and global attention outperform others in accuracy and robustness, especially with CTC decoders.

Contribution

It provides a comparative analysis of convolutional, squeeze-and-excitation, and attention-based ASR models on long audio, highlighting the superior performance of self-attention with CTC decoding.

Findings

01

Self-attention models achieve the lowest Word Error Rate.

02

CTC-based models are more robust and efficient than RNNT for long audio.

03

Self-attention with local and global tokens outperforms other architectures.

Abstract

This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing