Advances in Joint CTC-Attention based End-to-End Speech Recognition with   a Deep CNN Encoder and RNN-LM

Takaaki Hori; Shinji Watanabe; Yu Zhang; William Chan

arXiv:1706.02737·cs.CL·June 12, 2017·21 cites

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

PDF

Open Access 5 Repos

TL;DR

This paper introduces a state-of-the-art end-to-end speech recognition model combining a deep CNN encoder with joint CTC and attention mechanisms, enhanced by a language model, achieving significant error reduction on Japanese and Chinese speech datasets.

Contribution

The paper presents a novel joint CTC-attention model with a deep CNN encoder and integrated language model, outperforming traditional hybrid systems.

Findings

01

Achieved 5-10% error reduction on Japanese and Chinese speech datasets.

02

Outperformed traditional hybrid ASR systems.

03

Demonstrated effectiveness of joint CTC-attention training with CNN encoder.

Abstract

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDropout · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Softmax · Convolution · Ethereum Customer Service Number +1-833-534-1729