End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model
Yangyang Shi, Mei-Yuh Hwang, Xin Lei

TL;DR
This paper introduces a high rank projection layer in LSTM-CTC models for speech recognition, significantly improving their expressiveness and reducing word error rates on standard datasets without external data or augmentation.
Contribution
The paper proposes a novel high rank projection layer to enhance LSTM-CTC models' expressiveness in end-to-end speech recognition.
Findings
Achieves 4-6% relative WER reduction on WSJ and LibriSpeech datasets.
Outperforms other published CTC-based end-to-end models without external data.
Code is publicly available for reproducibility.
Abstract
Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) based end-to-end models are widely used in speech recognition due to its simplicity in training and efficiency in decoding. In conventional LSTM-CTC based models, a bottleneck projection matrix maps the hidden feature vectors obtained from LSTM to softmax output layer. In this paper, we propose to use a high rank projection layer to replace the projection matrix. The output from the high rank projection layer is a weighted combination of vectors that are projected from the hidden feature vectors via different projection matrices and non-linear activation function. The high rank projection layer is able to improve the expressiveness of LSTM-CTC models. The experimental results show that on Wall Street Journal (WSJ) corpus and LibriSpeech data set, the proposed method achieves 4%-6% relative word error rate (WER)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsSoftmax
