Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition
Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly, Lavrukhin, Vahid Noroozi, Boris Ginsburg

TL;DR
Citrinet is a deep convolutional CTC-based speech recognition model that narrows the performance gap between non-autoregressive and autoregressive models, achieving near state-of-the-art accuracy on multiple datasets.
Contribution
The paper introduces Citrinet, a novel deep residual convolutional model with separable convolutions and squeeze-and-excitation, significantly improving non-autoregressive ASR performance.
Findings
Achieves near state-of-the-art accuracy on LibriSpeech and other datasets.
Effectively reduces the gap between non-autoregressive and autoregressive models.
Demonstrates strong performance across multiple multilingual speech datasets.
Abstract
We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models. We evaluate Citrinet on LibriSpeech, TED-LIUM2, AISHELL-1 and Multilingual LibriSpeech (MLS) English speech datasets. Citrinet accuracy on these datasets is close to the best autoregressive Transducer models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/stt_en_citrinet_1024_gamma_0_25model· 26 dl· ♡ 326 dl♡ 3
- 🤗nvidia/stt_zh_citrinet_1024_gamma_0_25model· 251 dl· ♡ 5251 dl♡ 5
- 🤗nvidia/stt_en_citrinet_256_lsmodel· 80 dl· ♡ 180 dl♡ 1
- 🤗nvidia/stt_en_citrinet_384_lsmodel· 13 dl13 dl
- 🤗nvidia/stt_en_citrinet_512_lsmodel· 13 dl13 dl
- 🤗nvidia/stt_en_citrinet_768_lsmodel· 5 dl5 dl
- 🤗nvidia/stt_en_citrinet_1024_lsmodel
- 🤗nvidia/stt_uk_citrinet_1024_gamma_0_25model· 276 dl· ♡ 13276 dl♡ 13
- 🤗Aditya02/stt_en_citrinet_1024model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
