Expected Gradients of Maxout Networks and Consequences to Parameter Initialization
Hanna Tseran, Guido Mont\'ufar

TL;DR
This paper analyzes the gradients of maxout networks to develop initialization strategies that prevent vanishing or exploding gradients, improving training stability and efficiency in deep networks.
Contribution
It provides new bounds on gradients and Jacobian distributions, and proposes initialization methods tailored for maxout networks to enhance training performance.
Findings
Initialization strategies improve SGD and Adam training.
Derived bounds on the number of linear regions and Jacobian behavior.
Enhanced understanding of maxout network properties and training dynamics.
Abstract
We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBrain Tumor Detection and Classification · Machine Learning and ELM · Advanced Neural Network Applications
MethodsStochastic Gradient Descent · Adam · Neural Tangent Kernel · Maxout
