Improved Analysis for Sign-based Methods with Momentum Updates

Wei Jiang; Dingzhi Yu; Sifan Yang; Wenhao Yang; Lijun Zhang

arXiv:2507.12091·math.OC·July 17, 2025

Improved Analysis for Sign-based Methods with Momentum Updates

Wei Jiang, Dingzhi Yu, Sifan Yang, Wenhao Yang, Lijun Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper provides improved theoretical analysis for sign-based optimization algorithms with momentum, demonstrating better convergence rates under standard conditions and in distributed settings, validated by numerical experiments.

Contribution

It introduces a refined analysis showing signSGD with momentum achieves optimal convergence rates without large batch sizes or special noise assumptions.

Findings

01

SignSGD with momentum attains $oldsymbol{ ext{O}(T^{-1/4})}$ convergence rate with constant batch sizes.

02

The analysis improves convergence bounds by a factor of $oldsymbol{ ext{O}(d^{1/2})}$ under $l_2$-smoothness.

03

Distributed sign-based methods with momentum outperform previous algorithms in convergence speed.

Abstract

In this paper, we present enhanced analysis for sign-based optimization algorithms with momentum updates. Traditional sign-based methods, under the separable smoothness assumption, guarantee a convergence rate of $O (T^{- 1/4})$ , but they either require large batch sizes or assume unimodal symmetric stochastic noise. To address these limitations, we demonstrate that signSGD with momentum can achieve the same convergence rate using constant batch sizes without additional assumptions. Our analysis, under the standard $l_{2}$ -smoothness condition, improves upon the result of the prior momentum-based signSGD method by a factor of $O (d^{1/2})$ , where $d$ is the problem dimension. Furthermore, we explore sign-based methods with majority vote in distributed settings and show that the proposed momentum-based method yields convergence rates of $\mathcal{O}\left( d^{1/2}T^{-1/2} +…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper shows that their algorithm reduces the large-batch requirements and improves the dimension dependency in the theory of sign-based optimization. 2. The analysis is conducted under multiple standard assumptions, making the results robust and widely applicable, and the distributed analysis also accounts for heterogeneous data settings. 3. The experimental results demonstrate superior performance in both centralized and distributed environments.

Weaknesses

1. It remains unclear how the proposed algorithm compares with Ref. [1]. The method in Ref. [1] uses a fixed mini-batch size and imposes no noise assumptions, yet achieves an O(T^{-1/3}) complexity, whereas the present work reports only O(T^{-1/4}). Although the authors note that Ref. [1] assumes component-wise smoothness while this paper assumes global smoothness, both assumptions appear mild. 2. While the theory significantly improves the dependence on the dimension $d$, the experiments do no

Reviewer 02Rating 6Confidence 2

Strengths

1. The paper delivers a theoretical tightening for sign-based momentum methods in both centralized and distributed settings. On the centralized side, it attains the classical non-convex rate $O(T^{-1/4})$ for Signum under separable smoothness and bounded noise without resorting to large-batch or restrictive noise-shape assumptions, and under standard $\ell_2$-smoothness it improves the dimension dependence from $d$ to $d^{1/2}$. On the distributed side, introducing an unbiased server-side sign o

Weaknesses

1. The empirical scope is narrow: results focus on CIFAR-10 (centralized) and CIFAR-100 with eight nodes (distributed), leaving open how the methods behave in larger-scale, highly heterogeneous, or bandwidth-constrained regimes. 2. The theory relies on specific schedules for step size and momentum, yet the experiments use grid-tuned constants without ablations that test sensitivity to the prescribed schedules, which blurs the link between bounds and practice. 3. The comparative breadth could

Reviewer 03Rating 6Confidence 2

Strengths

Original tighter error analysis: Bounds $\sum_i |[\nabla f(x_t)]_i| \cdot \mathbb{P}(\text{sign mismatch})$ directly by $\|\nabla f(x_t) - v_t\|_1$ instead of probability inequalities requiring $\mathcal{O}(\sqrt{T})$ batches or symmetric noise assumptions Removes assumptions: No $\mathcal{O}(\sqrt{T})$ batches or symmetric noise for $\mathcal{O}(T^{-1/4})$ rate that are required for prior analysis Experimental validation: Experiments with ResNet on CIFAR dataset show faster gradient norm d

Weaknesses

Incremental improvements: The improvements are incremental rather than paradigm shifting. The improved bound is useful but may not have a large practical impact Weak experimental results: The experimental results are with ResNet on CIFAR datasets. These do not reflect modern uses cases of Sign-based methods. Experimental results on large models would make the paper stronger.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks