Residual Convolutional CTC Networks for Automatic Speech Recognition

Yisen Wang; Xuejiao Deng; Songbai Pu; Zhiheng Huang

arXiv:1702.07793·cs.CL·February 28, 2017·66 cites

Residual Convolutional CTC Networks for Automatic Speech Recognition

Yisen Wang, Xuejiao Deng, Songbai Pu, Zhiheng Huang

PDF

Open Access

TL;DR

This paper introduces a deep residual CNN architecture with CTC loss for improved automatic speech recognition, demonstrating significant error rate reductions on benchmark datasets.

Contribution

The paper proposes a novel deep residual CNN architecture with CTC loss for end-to-end speech recognition, and introduces a CTC-based system combination method.

Findings

01

Achieved lowest WER on WSJ and Tencent Chat datasets.

02

System combination further reduced error rates.

03

Demonstrated effectiveness of deep residual CNNs in ASR.

Abstract

Deep learning approaches have been widely used in Automatic Speech Recognition (ASR) and they have achieved a significant accuracy improvement. Especially, Convolutional Neural Networks (CNNs) have been revisited in ASR recently. However, most CNNs used in existing work have less than 10 layers which may not be deep enough to capture all human speech signal information. In this paper, we propose a novel deep and wide CNN architecture denoted as RCNN-CTC, which has residual connections and Connectionist Temporal Classification (CTC) loss function. RCNN-CTC is an end-to-end system which can exploit temporal and spectral structures of speech signals simultaneously. Furthermore, we introduce a CTC-based system combination, which is different from the conventional frame-wise senone-based one. The basic subsystems adopted in the combination are different types and thus mutually complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing