Towards Theoretically Inspired Neural Initialization Optimization
Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin

TL;DR
This paper introduces GradCosine, a theoretically motivated metric for neural initialization, and proposes NIO, an automated method to optimize initial weights, improving performance across various architectures and datasets.
Contribution
The paper presents a novel differentiable metric, GradCosine, and an automated initialization optimization algorithm, NIO, for neural networks, reducing reliance on handcrafted initializations.
Findings
NIO improves classification accuracy on CIFAR-10, CIFAR-100, and ImageNet.
GradCosine correlates with training and test performance.
NIO enables training large vision Transformers without warmup.
Abstract
Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout
