Magnitude Invariant Parametrizations Improve Hypernetwork Learning
Jose Javier Gonzalez Ortiz, John Guttag, Adrian Dalca

TL;DR
This paper identifies a fundamental problem in hypernetwork training related to magnitude proportionality and proposes Magnitude Invariant Parametrizations (MIP) to stabilize and accelerate training across various tasks.
Contribution
The paper introduces MIP, a simple reformulation that addresses magnitude proportionality issues, improving hypernetwork training stability and convergence speed.
Findings
MIP stabilizes hypernetwork training across multiple tasks.
MIP consistently accelerates convergence in experiments.
The approach is effective with various activation functions and architectures.
Abstract
Hypernetworks, neural networks that predict the parameters of another neural network, are powerful models that have been successfully used in diverse applications from image generation to multi-task learning. Unfortunately, existing hypernetworks are often challenging to train. Training typically converges far more slowly than for non-hypernetwork models, and the rate of convergence can be very sensitive to hyperparameter choices. In this work, we identify a fundamental and previously unidentified problem that contributes to the challenge of training hypernetworks: a magnitude proportionality between the inputs and outputs of the hypernetwork. We demonstrate both analytically and empirically that this can lead to unstable optimization, thereby slowing down convergence, and sometimes even preventing any learning. We present a simple solution to this problem using a revised hypernetwork…
Peer Reviews
Decision·ICLR 2024 poster
While normalizing inputs to neural network models is already well established best practice, to the best of my knowledge the specific application of this best practice for hypernetwork inputs has not been studied as much. I consider the fact that the input encoding approach of the authors is straightforward a plus. The experimental validation of the key claims of the papers is extensive.
I am not really sure if the output encoding part of the framework fits well with the problem that the authors claim to solve. It is not clear how output encoding relates to the input and output proportionality problem. It also makes interpreting the experimental results where both input and output encoding are used harder. Intuitively, output encoding allows the model to learn the task even if the hypernetwork does nothing so it is not clear if we improve hypernetwork training or just make the h
The strengths of this paper include the identification of a novel optimization problem in hypernetwork training, the proposal of a new formulation (MIP) that addresses this issue without extra computational costs, extensive testing and comparative analysis demonstrating MIP's effectiveness, and the provision of an open-source library, HyperLight, to facilitate the practical adoption of the proposed solution in hypernetwork models. Through rigorous analysis and extensive experimentation, the pape
The paper mainly focuses on fully connected layers and common activation, initialization choices, and optimizers (SGD with momentum and Adam) in its experiments, which may not encompass a broader spectrum of hypernetwork architectures or other types of networks. There's also a mention of unexplored territories like the effect of MIP on transfer learning and other less common architectures and optimizers, indicating a scope for broader empirical validation. Furthermore, the impact of MIP on real-
- The authors identify a novel issue that seems to be important (based on the improvement delta from MIP) for hypernetwork training. - The specific MIP parametrisation is novel and practically broadly useful for any task that involves hypernetworks. - The experiments are relatively extensive in terms of tasks and ablation / robustness studies. - The paper is mostly well written and clear in the presentation of the main ideas and results.
- While the authors do empirically show that MIP benefits training, it is not clear whether the increased variance could also be controlled with, e.g., appropriately chosen (i.e., lower) learning rates and (i.e., higher) momentum, in the original parametrisation (which could attain similar performance, albeit slower). - This is something that the authors themselves identify, but given that hypernetworks are becoming popular for fast adaptation of pertained models, e.g., [1], it is important to s
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Generative Adversarial Networks and Image Synthesis
MethodsTest · HyperNetwork
