A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods
Wei Lin, Qingyu Song, Hong Xu

TL;DR
This paper introduces a learn-to-optimize method for coordinate-wise step sizes in quasi-Newton methods, specifically BFGS, using LSTM networks to improve convergence speed while maintaining theoretical guarantees.
Contribution
It provides a theoretical analysis for coordinate-wise step sizes in BFGS and develops a novel L2O approach that learns optimal step sizes with proven convergence properties.
Findings
Achieves up to 4x faster convergence than baseline methods.
Demonstrates effectiveness across diverse optimization tasks.
Provides theoretical conditions ensuring stability and convergence.
Abstract
Tuning step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes have been shown to outperform scalar step size in first-order methods, their use in second-order methods is still under-explored and more challenging. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these limitations, we first conduct a theoretical analysis within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, a prominent quasi-Newton method, and derive sufficient conditions for coordinate-wise step sizes that ensure convergence and stability. Building on this theoretical foundation, we introduce a novel learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes by leveraging insights from past…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**Strength** The paper is overall well-written and easy to follow. The efficiency of the approach is validated by the experiments.
**Weaknesses** I have several concerns regarding the motivation of the proposed approach and the theoretical results. 1. Motivation of BFGS + diagonal stepsize I find it unnatural to incorporate a diagonal stepsize (preconditioner) into BFGS. The scaling matrix in BFGS already serves as a preconditioner. And laying another preconditioner on top of it seems incremental and not well-justified. In particular, the affine invariance property of BFGS seems incompatible with the diagonal stepsize
### 1 Clear motivation The motivation is reasonable: a single global step size can be overly conservative because it is limited by the most “dangerous” direction. A diagonal matrix $ P_k $ lets you shrink only the risky coordinates and keep making larger progress along safe coordinates. The paper even argues (conceptually) that this can produce strictly better objective decrease than using only a scalar line-search step. ### 2 Theory-driven design The theory section is structured around three g
### 1 Novelty is oversold The paper repeatedly positions itself as “the first to investigate coordinate-wise step sizes in quasi-Newton (BFGS) with theory + learned policy.” Conceptually, though, what they are doing is extremely close to two well-known ideas: 1. **Diagonal preconditioning / per-parameter scaling.** Scaling each coordinate of the update direction by a learned positive factor is, in spirit, just adaptive diagonal preconditioning. That idea is old in both first-order and quas
The paper lays out the coordinate-wise step-size (CWSS) idea, its integration into BFGS, and the L2O training protocol in a way that is easy to follow. Key design choices—hard clipping, spectral regularisation, and the separation of offline meta-training from online deployment—are all motivated up-front. The convergence‐rate and stability theorems are carefully stated, the assumptions are explicit, and the proofs in the appendix are complete enough to be reproducible.
1. Again, I suggest adding the pseudocode of the proposed method in your appendix, which makes reproduction much simpler. 2. Would learning a single scalar learning rate be possible? Also, there should be a comparison between such a variant and the full coordinate-wise step size approach.
- Theorems show that with some assumptions, this optimizer will converge to the optimum. - The optimizer is well-constructed to fit the assumptions under which the optimizer will converge. (See parameterization of $P_k$ on line 344). - Experiments show improved learning speed on toy tasks.
- The memory requirement of the LSTM operating coordinate-wise is extremely inflated compared to common optimizers like SGD, Adam, and Muon, which also scale linearly in the problem dimensionality but with extremely small multiplicative constant. The memory taken by the optimizer will become a major bottleneck that will really hurt if the optimizer is ever to be used on realistic-scale problems. While the method may be novel, I struggle to see how it is practically useful. - The experiments are
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Advanced Numerical Analysis Techniques · Model Reduction and Neural Networks
