Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers
Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter, Bell, Steve Renals

TL;DR
This paper introduces a novel top-down training method for neural networks that improves classifier transferability within the same dataset, demonstrated through significant performance gains in speech recognition and language modeling tasks.
Contribution
The paper proposes a new cascade training approach that trains classifiers from upper to lower layers, enhancing within-dataset transferability and model performance.
Findings
Improved RNN ASR performance on Wall Street Journal
Enhanced self-attention ASR results on Switchboard
Better AWD-LSTM language model metrics on WikiText-2
Abstract
Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and retraining the classifier (the upper layers) on the same dataset leads to worse performance. In this paper, for the first time, we show that the frozen classifier is transferable within the same dataset. We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers. We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks. The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsSigmoid Activation · Tanh Activation · Variational Dropout · Dropout · Weight Tying · DropConnect · Long Short-Term Memory · Activation Regularization · Temporal Activation Regularization · Embedding Dropout
