A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation
Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard, Socher

TL;DR
This paper investigates the effectiveness of common deep learning heuristics like learning rate schedules and distillation using advanced analysis tools, revealing new insights into their roles in training dynamics.
Contribution
It introduces a novel empirical analysis approach using mode connectivity and CCA to explain why heuristics like warmup and distillation work in deep learning.
Findings
Cosine annealing success is not supported by empirical evidence.
Learning rate warmup prevents training instability in deeper layers.
Latent knowledge from the teacher is mainly transferred to deeper layers.
Abstract
The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsKnowledge Distillation · Cosine Annealing
