Towards Theoretically Inspired Neural Initialization Optimization

Yibo Yang; Hong Wang; Haobo Yuan; Zhouchen Lin

arXiv:2210.05956·cs.LG·October 13, 2022·1 cites

Towards Theoretically Inspired Neural Initialization Optimization

Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces GradCosine, a theoretically motivated metric for neural initialization, and proposes NIO, an automated method to optimize initial weights, improving performance across various architectures and datasets.

Contribution

The paper presents a novel differentiable metric, GradCosine, and an automated initialization optimization algorithm, NIO, for neural networks, reducing reliance on handcrafted initializations.

Findings

01

NIO improves classification accuracy on CIFAR-10, CIFAR-100, and ImageNet.

02

GradCosine correlates with training and test performance.

03

NIO enables training large vision Transformers without warmup.

Abstract

Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HarborYuan/GradCosine
noneOfficial

Videos

Towards Theoretically Inspired Neural Initialization Optimization· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Machine Learning and Data Classification

MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout