Improved Analysis for Sign-based Methods with Momentum Updates
Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, Lijun Zhang

TL;DR
This paper provides improved theoretical analysis for sign-based optimization algorithms with momentum, demonstrating better convergence rates under standard conditions and in distributed settings, validated by numerical experiments.
Contribution
It introduces a refined analysis showing signSGD with momentum achieves optimal convergence rates without large batch sizes or special noise assumptions.
Findings
SignSGD with momentum attains $oldsymbol{ ext{O}(T^{-1/4})}$ convergence rate with constant batch sizes.
The analysis improves convergence bounds by a factor of $oldsymbol{ ext{O}(d^{1/2})}$ under $l_2$-smoothness.
Distributed sign-based methods with momentum outperform previous algorithms in convergence speed.
Abstract
In this paper, we present enhanced analysis for sign-based optimization algorithms with momentum updates. Traditional sign-based methods, under the separable smoothness assumption, guarantee a convergence rate of , but they either require large batch sizes or assume unimodal symmetric stochastic noise. To address these limitations, we demonstrate that signSGD with momentum can achieve the same convergence rate using constant batch sizes without additional assumptions. Our analysis, under the standard -smoothness condition, improves upon the result of the prior momentum-based signSGD method by a factor of , where is the problem dimension. Furthermore, we explore sign-based methods with majority vote in distributed settings and show that the proposed momentum-based method yields convergence rates of $\mathcal{O}\left( d^{1/2}T^{-1/2} +…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper shows that their algorithm reduces the large-batch requirements and improves the dimension dependency in the theory of sign-based optimization. 2. The analysis is conducted under multiple standard assumptions, making the results robust and widely applicable, and the distributed analysis also accounts for heterogeneous data settings. 3. The experimental results demonstrate superior performance in both centralized and distributed environments.
1. It remains unclear how the proposed algorithm compares with Ref. [1]. The method in Ref. [1] uses a fixed mini-batch size and imposes no noise assumptions, yet achieves an O(T^{-1/3}) complexity, whereas the present work reports only O(T^{-1/4}). Although the authors note that Ref. [1] assumes component-wise smoothness while this paper assumes global smoothness, both assumptions appear mild. 2. While the theory significantly improves the dependence on the dimension $d$, the experiments do no
1. The paper delivers a theoretical tightening for sign-based momentum methods in both centralized and distributed settings. On the centralized side, it attains the classical non-convex rate $O(T^{-1/4})$ for Signum under separable smoothness and bounded noise without resorting to large-batch or restrictive noise-shape assumptions, and under standard $\ell_2$-smoothness it improves the dimension dependence from $d$ to $d^{1/2}$. On the distributed side, introducing an unbiased server-side sign o
1. The empirical scope is narrow: results focus on CIFAR-10 (centralized) and CIFAR-100 with eight nodes (distributed), leaving open how the methods behave in larger-scale, highly heterogeneous, or bandwidth-constrained regimes. 2. The theory relies on specific schedules for step size and momentum, yet the experiments use grid-tuned constants without ablations that test sensitivity to the prescribed schedules, which blurs the link between bounds and practice. 3. The comparative breadth could
Original tighter error analysis: Bounds $\sum_i |[\nabla f(x_t)]_i| \cdot \mathbb{P}(\text{sign mismatch})$ directly by $\|\nabla f(x_t) - v_t\|_1$ instead of probability inequalities requiring $\mathcal{O}(\sqrt{T})$ batches or symmetric noise assumptions Removes assumptions: No $\mathcal{O}(\sqrt{T})$ batches or symmetric noise for $\mathcal{O}(T^{-1/4})$ rate that are required for prior analysis Experimental validation: Experiments with ResNet on CIFAR dataset show faster gradient norm d
Incremental improvements: The improvements are incremental rather than paradigm shifting. The improved bound is useful but may not have a large practical impact Weak experimental results: The experimental results are with ResNet on CIFAR datasets. These do not reflect modern uses cases of Sign-based methods. Experimental results on large models would make the paper stronger.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks
