Sign-SGD via Parameter-Free Optimization
Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov

TL;DR
This paper introduces a parameter-free Sign-SGD optimizer that eliminates manual stepsize tuning, enabling efficient training of large models with comparable performance and about 1.5 times faster training speed.
Contribution
The paper develops a parameter-free Sign-SGD method applicable to single-node and multi-node training, incorporating momentum and gradient sign storage, reducing tuning overhead.
Findings
Matches performance of tuned Sign-SGD and AdamW
Achieves approximately 1.5x speedup in training
Effectively trains large language models without stepsize tuning
Abstract
Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper tackles an important problem of mitigating hyperparameter tuning of optimizers which is quite relevant in the current era of LLMs. * The proposed algorithm, guided by theory, is simple to implement and has practical benefits such as lower costs in single-node/distributed settings and no need for LR tuning. Also, kudos to authors for providing a memory-efficient version of the algorithm that is already efficient relative to SOTA optimizers such as AdamW.
* There are a bunch of hyperparameter-free methods that also eliminate the need for manually setting step sizes [1] but no such baselines are presented in this work, why? * Appendix A.1.1 mentions that only main model params are optimized by ALIAS and LM head params are optimized by AdamW — this detail is quite critical and must be mentioned in the main paper. How is AdamW learning rate selected for LM head params? Is it tuned through grid-search? If yes, then isn’t the point of param-free / sav
1. The authors propose a hyper-parameter-free sign-GD algorithm with its stochastic version and memory-efficient version. 2. For all versions, the authors give a convergence analysis. 3. The experimental results of the stochastic version and memory-efficient version work well on 130M model.
1. It is not clear what algorithm it is in the stochastic version, especially in the calculation of $d_t$. 2. The expression of $\epsilon = xxx$ is confusing. It seems that $\epsilon$ is also related to the $T$ and gradient norm. But according to the proof, it should be $xxx \leq \epsilon$ when $T \leq O(...)$. 3. It is well-known that sign-sgd will not converge. It is unclear how the algorithm overcomes the issue.
1. Novel param-free algorithm for SIGN-SGD: the paper introduces a novel optimizer (ALIAS) that requires no manual step size tuning for SIGN-SGD. This is an original contribution, as prior sign-based methods either fixed a step size or relied on problem-dependent choices. ALIAS uses a clever per-iteration adaptation that accumulates an estimate of the local smoothness and the loss gap $\Delta^*$ to adjust $\gamma_t$ automatically. This approach is param-free in that it doesn’t need line searches
1. Theoretical assumption: the theoretical guarantees are restricted to convex optimization (w/ smoothness assumption) and are expressed in terms of finding stationary points (grad norm). While this is standard for sign-based methods, it means the theory doesn’t directly guarantee improvement on the non-convex training objectives that the experiments address. The authors do discuss why a convex analysis is still informative for sign methods, but the strongest guarantees (Theorems 3.5, 3.9) hold
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Stochastic Gradient Optimization Techniques
