FOSI: Hybrid First and Second Order Optimization
Hadar Sivan, Moshe Gabel, Assaf Schuster

TL;DR
FOSI is a meta-algorithm that enhances first-order optimizers by efficiently integrating second-order information, leading to faster convergence and better performance in machine learning tasks.
Contribution
FOSI introduces a novel method to incorporate second-order information into any first-order optimizer by splitting the function into quadratic subspaces and applying different optimization strategies.
Findings
FOSI improves convergence rate of first-order methods like Adam and Heavy-Ball.
FOSI outperforms traditional second-order methods such as K-FAC and L-BFGS.
Empirical results show faster optimization times with FOSI.
Abstract
Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).
Peer Reviews
Decision·ICLR 2024 poster
This submission is well-organized with clear language and structures. The authors gave detailed description and some theoretical analysis for the proposed algorithms. They also conduct a lot of numerical experiments on deep learning problems and these empirical results are pretty good compared with some state-of-art optimization methods. The idea is pretty interesting and enlightens some promising future direction for the optimization community.
There are some disadvantages regarding this submission. The authors only gave the theoretical results for the stochastic optimization problem. What's the convergence rate for the general convex optimization problem? What is the convergence rate for the strongly convex setting? If the authors could add and present these theoretical analysis. This could significantly improve the quality of this submission. It's better to put the detailed algorithm from the appendix to the main part of the paper.
The idea that splitting the raw space into two orthogonal spaces is interesting. The authors adopt the Lanczos to give a possible way to construct these spaces.
One of my major concern is that the memory consumption and computational complexity are very high especially for large-scale neural networks. This will limit the usage of the proposed method. Besides, it is not clear how to handle the communication cost and the computation of $V$ in the distributed setting. The scale of the network architecture used in the numerical experiments is limited. It will be more convincing if the authors can show the effectiveness of the proposed method in larger app
Nice analytics with detailed derivations and explanation. Large amount of empirical studies shown with experimental results. Steps of the algorithm are clearly specified. Enjoyed reading the paper.
A few failure cases may be discussed. Although decomposing the problem into two parts may not specifically be novel, FOSI’s inverse preconditioner seems to be quite a good idea. Similar work on those lines of decomposing may be mentioned.
Code & Models
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Control Systems Optimization · Embedded Systems Design Techniques
MethodsBalanced Selection · Stochastic Gradient Descent · Adam
