Dropout in Training Neural Networks: Flatness of Solution and Noise   Structure

Zhongwang Zhang; Hanxu Zhou; Zhi-Qin John Xu

arXiv:2111.01022·cs.LG·May 24, 2022·1 cites

Dropout in Training Neural Networks: Flatness of Solution and Noise Structure

Zhongwang Zhang, Hanxu Zhou, Zhi-Qin John Xu

PDF

Open Access

TL;DR

This paper investigates how dropout regularization leads neural networks to flatter minima with noise structures aligned to the Hessian, enhancing understanding of its role in improving generalization across various architectures and datasets.

Contribution

The work provides empirical evidence and theoretical analysis showing dropout induces flatter minima with noise aligned to the Hessian, explaining its effectiveness in regularization.

Findings

01

Dropout results in flatter minima compared to standard training.

02

Dropout noise variance is larger in sharper loss landscape directions.

03

Hessian and dropout noise covariance are similar, explaining dropout's effectiveness.

Abstract

It is important to understand how the popular regularization method dropout helps the neural network training find a good generalization solution. In this work, we show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training. We further find that the variance of a noise induced by the dropout is larger at the sharper direction of the loss landscape and the Hessian of the loss landscape at the found minima aligns with the noise covariance matrix by experiments on various datasets, i.e., MNIST, CIFAR-10, CIFAR-100 and Multi30k, and various structures, i.e., fully-connected networks, large residual convolutional networks and transformer. For networks with piece-wise linear activation function and the dropout is only at the last hidden layer, we then theoretically derive the Hessian and the covariance of dropout…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and ELM

MethodsStochastic Gradient Descent · Dropout