Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech   Recognition

Xinpei Zhou; Jiwei Li; Xi Zhou

arXiv:1810.12001·eess.AS·October 31, 2018·1 cites

Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition

Xinpei Zhou, Jiwei Li, Xi Zhou

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end speech recognition model combining CNN, residual BiLSTM, and CTC, achieving low WER and faster training by innovative architecture and training techniques.

Contribution

The paper presents a new cascaded CNN-resBiLSTM-CTC architecture with residual blocks and a batch-varied training method for improved accuracy and efficiency in speech recognition.

Findings

01

Achieved 3.41% WER on LibriSpeech test clean.

02

Reduced training time by 25% with the new batch-varied method.

03

Enhanced phoneme and semantic feature extraction through residual BiLSTM layers.

Abstract

Automatic speech recognition (ASR) tasks are resolved by end-to-end deep learning models, which benefits us by less preparation of raw data, and easier transformation between languages. We propose a novel end-to-end deep learning model architecture namely cascaded CNN-resBiLSTM-CTC. In the proposed model, we add residual blocks in BiLSTM layers to extract sophisticated phoneme and semantic information together, and apply cascaded structure to pay more attention mining information of hard negative samples. By applying both simple Fast Fourier Transform (FFT) technique and n-gram language model (LM) rescoring method, we manage to achieve word error rate (WER) of 3.41% on LibriSpeech test clean corpora. Furthermore, we propose a new batch-varied method to speed up the training process in length-varied tasks, which result in 25% less training time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Bidirectional LSTM