Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati; James Qin; Chung-Cheng Chiu; Niki Parmar; Yu Zhang,; Jiahui Yu; Wei Han; Shibo Wang; Zhengdong Zhang; Yonghui Wu; Ruoming Pang

arXiv:2005.08100·eess.AS·May 19, 2020·381 cites

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

PDF

Open Access 5 Repos 10 Models

TL;DR

The paper introduces Conformer, a hybrid model combining convolutional neural networks and transformers, achieving state-of-the-art speech recognition accuracy with parameter efficiency on LibriSpeech.

Contribution

It proposes the Conformer architecture that effectively integrates CNNs and transformers for improved local and global dependency modeling in speech recognition.

Findings

01

Conformer achieves 2.1%/4.3% WER on LibriSpeech without a language model.

02

Conformer outperforms previous models with fewer parameters.

03

Small Conformer model (10M parameters) achieves competitive results.

Abstract

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Convolution · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout