Improvements to deep convolutional neural networks for LVCSR
Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E., Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana, Ramabhadran

TL;DR
This paper enhances deep CNNs for large vocabulary continuous speech recognition by analyzing sharing strategies, applying advanced pooling, integrating speaker adaptation, and using dropout, leading to significant WER improvements.
Contribution
It introduces novel methods for CNN optimization in LVCSR, including effective speaker adaptation and dropout strategies during sequence training.
Findings
Achieved 2-3% relative WER reduction on 50-hour BN task
Achieved 4-5% relative WER reduction on 400-hour BN task
Validated improvements over previous CNN baselines
Abstract
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
