ContextNet: Improving Convolutional Neural Networks for Automatic Speech   Recognition with Global Context

Wei Han; Zhengdong Zhang; Yu Zhang; Jiahui Yu; Chung-Cheng Chiu; James; Qin; Anmol Gulati; Ruoming Pang; Yonghui Wu

arXiv:2005.03191·eess.AS·May 19, 2020·72 cites

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James, Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

PDF

Open Access 5 Repos

TL;DR

ContextNet introduces a novel CNN-RNN-transducer architecture with global context integration, achieving state-of-the-art speech recognition accuracy on LibriSpeech with fewer parameters and without external language models.

Contribution

The paper proposes ContextNet, a CNN-RNN-transducer with global context modules and a scaling method, advancing CNN performance in end-to-end speech recognition.

Findings

01

Achieves 2.1%/4.6% WER on LibriSpeech without external LM.

02

Outperforms previous CNN-based systems in accuracy and parameter efficiency.

03

Validated on large internal dataset showing superior results.

Abstract

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution