Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

TL;DR
The paper introduces Conformer, a hybrid model combining convolutional neural networks and transformers, achieving state-of-the-art speech recognition accuracy with parameter efficiency on LibriSpeech.
Contribution
It proposes the Conformer architecture that effectively integrates CNNs and transformers for improved local and global dependency modeling in speech recognition.
Findings
Conformer achieves 2.1%/4.3% WER on LibriSpeech without a language model.
Conformer outperforms previous models with fewer parameters.
Small Conformer model (10M parameters) achieves competitive results.
Abstract
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/medasrmodel· 15k dl· ♡ 29715k dl♡ 297
- 🤗nvidia/parakeet-ctc-0.6b-Vietnamesemodel· 681 dl· ♡ 82681 dl♡ 82
- 🤗nvidia/stt_en_conformer_ctc_largemodel· 1.3k dl· ♡ 301.3k dl♡ 30
- 🤗nvidia/stt_en_fastconformer_hybrid_large_streaming_multimodel· 929 dl· ♡ 24929 dl♡ 24
- 🤗eesungkim/stt_kr_conformer_transducer_largemodel· 51 dl· ♡ 1051 dl♡ 10
- 🤗nvidia/stt_en_conformer_transducer_xlargemodel· 63 dl· ♡ 5663 dl♡ 56
- 🤗nvidia/stt_de_conformer_ctc_largemodel· 140 dl· ♡ 5140 dl♡ 5
- 🤗nvidia/stt_de_conformer_transducer_largemodel· 14 dl· ♡ 714 dl♡ 7
- 🤗nvidia/stt_fr_conformer_ctc_largemodel· 106 dl· ♡ 7106 dl♡ 7
- 🤗nvidia/stt_zh_conformer_transducer_largemodel· 649 dl· ♡ 13649 dl♡ 13
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Convolution · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout
