A Closer Look at Deep Learning Heuristics: Learning rate restarts,   Warmup and Distillation

Akhilesh Gotmare; Nitish Shirish Keskar; Caiming Xiong; Richard; Socher

arXiv:1810.13243·cs.LG·November 1, 2018·70 cites

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, Richard, Socher

PDF

Open Access

TL;DR

This paper investigates the effectiveness of common deep learning heuristics like learning rate schedules and distillation using advanced analysis tools, revealing new insights into their roles in training dynamics.

Contribution

It introduces a novel empirical analysis approach using mode connectivity and CCA to explain why heuristics like warmup and distillation work in deep learning.

Findings

01

Cosine annealing success is not supported by empirical evidence.

02

Learning rate warmup prevents training instability in deeper layers.

03

Latent knowledge from the teacher is mainly transferred to deeper layers.

Abstract

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification

MethodsKnowledge Distillation · Cosine Annealing