Layer rotation: a surprisingly powerful indicator of generalization in deep networks?
Simon Carbonnelle, Christophe De Vleeschouwer

TL;DR
This paper demonstrates that the evolution of layer rotation, measured by cosine distance from initialization, is a strong indicator of generalization in deep networks, with optimal performance when layers reach a cosine distance of 1.
Contribution
It introduces layer rotation as a novel, consistent indicator of generalization and shows how it can be monitored and controlled to improve training outcomes.
Findings
Larger layer rotation correlates with better generalization.
Optimal generalization occurs when layers reach a cosine distance of 1 from initialization.
Layer rotation can be used to guide hyperparameter tuning.
Abstract
Our work presents extensive empirical evidence that layer rotation, i.e. the evolution across training of the cosine distance between each layer's weight vector and its initialization, constitutes an impressively consistent indicator of generalization performance. In particular, larger cosine distances between final and initial weights of each layer consistently translate into better generalization performance of the final model. Interestingly, this relation admits a network independent optimum: training procedures during which all layers' weights reach a cosine distance of 1 from their initialization consistently outperform other configurations -by up to 30% test accuracy. Moreover, we show that layer rotations are easily monitored and controlled (helpful for hyperparameter tuning) and potentially provide a unified framework to explain the impact of learning rate tuning, weight decay,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
