Advances in Very Deep Convolutional Neural Networks for LVCSR
Tom Sercu, Vaibhava Goel

TL;DR
This paper introduces a new very deep CNN design for LVCSR that enables efficient sequence training and deployment, matching state-of-the-art performance on large datasets by removing timepadding and timepooling.
Contribution
The paper proposes a CNN architecture without timepadding and timepooling, facilitating efficient sequence evaluation and batch normalization, leading to improved large-scale speech recognition performance.
Findings
Achieved 9.4% WER on Hub5 test set with a single model.
Enabled efficient sequence training and deployment of CNNs.
Matched previous best results with a simpler, more scalable model.
Abstract
Very deep CNNs with small 3x3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. In this paper we investigate how to efficiently scale these models to larger datasets. Specifically, we address the design choice of pooling and padding along the time dimension which renders convolutional evaluation of sequences highly inefficient. We propose a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data. Through batch normalization, we recover the lost peformance from removing the time-pooling, while keeping the benefit of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
