Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks
Dong Yu, Michael L. Seltzer, Jinyu Li, Jui-Ting Huang, Frank Seide

TL;DR
This paper demonstrates that deep neural networks improve speech recognition by learning robust internal representations that are increasingly invariant to input perturbations, outperforming traditional models when trained on representative data.
Contribution
It provides insights into how DNNs learn stable, discriminative features for speech recognition and highlights their robustness and limitations compared to shallow models.
Findings
Deeper networks produce more invariant internal features.
DNNs perform as well or better than GMMs without explicit adaptation.
Internal features are stable across speaker and environmental variations.
Abstract
Recent studies have shown that deep neural networks (DNNs) perform significantly better than shallow networks and Gaussian mixture models (GMMs) on large vocabulary speech recognition tasks. In this paper, we argue that the improved accuracy achieved by the DNNs is the result of their ability to extract discriminative internal representations that are robust to the many sources of variability in speech signals. We show that these representations become increasingly insensitive to small perturbations in the input with increasing network depth, which leads to better speech recognition performance with deeper networks. We also show that DNNs cannot extrapolate to test samples that are substantially different from the training examples. If the training data are sufficiently representative, however, internal features learned by the DNN are relatively stable with respect to speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
