Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive   End-to-End Models for Automatic Speech Recognition

Somshubra Majumdar; Jagadeesh Balam; Oleksii Hrinchuk; Vitaly; Lavrukhin; Vahid Noroozi; Boris Ginsburg

arXiv:2104.01721·eess.AS·April 6, 2021·43 cites

Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly, Lavrukhin, Vahid Noroozi, Boris Ginsburg

PDF

Open Access 9 Models

TL;DR

Citrinet is a deep convolutional CTC-based speech recognition model that narrows the performance gap between non-autoregressive and autoregressive models, achieving near state-of-the-art accuracy on multiple datasets.

Contribution

The paper introduces Citrinet, a novel deep residual convolutional model with separable convolutions and squeeze-and-excitation, significantly improving non-autoregressive ASR performance.

Findings

01

Achieves near state-of-the-art accuracy on LibriSpeech and other datasets.

02

Effectively reduces the gap between non-autoregressive and autoregressive models.

03

Demonstrates strong performance across multiple multilingual speech datasets.

Abstract

We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models. We evaluate Citrinet on LibriSpeech, TED-LIUM2, AISHELL-1 and Multilingual LibriSpeech (MLS) English speech datasets. Citrinet accuracy on these datasets is close to the best autoregressive Transducer models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing