Glocal Hypergradient Estimation with Koopman Operator
Ryuichiro Hataya, Yoshinobu Kawahara

TL;DR
This paper introduces a novel hypergradient estimation method called glocal, which combines the reliability of global hypergradients with the efficiency of local hypergradients using Koopman operator theory to linearize hypergradient dynamics.
Contribution
We propose a new glocal hypergradient estimation method that leverages Koopman operator theory to efficiently approximate global hypergradients from local hypergradient trajectories.
Findings
Glocal hypergradient estimation achieves reliable hyperparameter optimization.
The method demonstrates efficiency comparable to local methods.
Numerical experiments validate the effectiveness of the approach.
Abstract
Gradient-based hyperparameter optimization methods update hyperparameters using hypergradients, gradients of a meta criterion with respect to hyperparameters. Previous research used two distinct update strategies: optimizing hyperparameters using global hypergradients obtained after completing model training or local hypergradients derived after every few model updates. While global hypergradients offer reliability, their computational cost is significant; conversely, local hypergradients provide speed but are often suboptimal. In this paper, we propose *glocal* hypergradient estimation, blending "global" quality with "local" efficiency. To this end, we use the Koopman operator theory to linearize the dynamics of hypergradients so that the global hypergradients can be efficiently approximated only by using a trajectory of local hypergradients. Consequently, we can optimize…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper studies an important problem, since hyperparameter optimization is a common challenge in training neural nets. The paper's approach using Koopman operator theory is clever and connects hyperparameter optimization to nonlinear dynamical systems. The algorithm and the computational complexities are clearly written, and I appreciate the diagnostic plots in the experimental section.
1. Important design choices are not given: (1) how should we select the dimension of the Koopman operator $n$? Intuitively, $n$ should depend on properties of the underlying dynamical system, and it would be helpful to have some guidelines. (2) how should we select $\textbf{g}$? Authors use Hankel DMD in the experiments, and it would be great to provide some justification. 2. Experiments: (1) the experiments are relatively small scale. Authors mention that global hypergradients are difficult t
1. The integration of Koopman operator theory to enhance hypergradient estimation is a novel approach, offering a fresh perspective on hyperparameter optimization. 2. The method significantly reduces computational costs compared to traditional global hypergradient approaches. 3. The approach is scalable to large-scale problems, making it applicable to real-world deep learning tasks. Furthermore, the paper provides numerical experiments demonstrating the method's effectiveness in various scenar
1. Algorithm 1 and Theorem 3.1 rely on assumptions about the spectral radius and stability, which may not hold in all cases. 2. The theoretical foundation involving Koopman operators may be complex for practitioners unfamiliar with the concept. 3. The experiments are somewhat limited. Could additional datasets be included, or could comparative experiments be conducted on other models as well? 4. The presentation of the experimental results is somewhat unclear. For example, all the experimenta
I'm on the fence on this submission -- the method is well-motivated and the derivation is for the most part clear, but I felt that the experimental results are somewhat of a let down. * The paper is overall well-written, with a few minor typos. The authors do an admirable job of making their theory tractable and easy to read. * Meta optimization is an important problem and the research setting is well-motivated. * The authors provide a thorough runtime comparison and associated discussion, whic
1. I'm somewhat suspicious of the handwaving around non-unit eigenvalues. Specifically, consider the section from line 246 - 252, where basically all eigenvalues besides those which are equal to one are discarded. Is there any theoretically grounded explanation from doing so? If the koopman operator says that the global hypergradients should oscillate, when intervene and artificially eliminate those modes? Similarly, when solving the DMD for $K$ as in (9), I don't see why you would get modes wit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Model Reduction and Neural Networks · Advanced Image Processing Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
