Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Rustem Islamov; Samuel Horvath; Aurelien Lucchi; Peter Richtarik; Eduard Gorbunov

arXiv:2502.11682·cs.LG·March 6, 2026

Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Clip21-SGD2M, a novel federated learning algorithm combining clipping, momentum, and error feedback, achieving both optimal differential privacy and strong optimization guarantees in heterogeneous, non-convex settings.

Contribution

The paper proposes Clip21-SGD2M, the first method to simultaneously attain optimal differential privacy and convergence guarantees in federated learning with heterogeneous data.

Findings

01

Outperforms baselines in non-convex logistic regression

02

Achieves near-optimal differential privacy guarantees

03

Demonstrates superior optimization performance in neural network training

Abstract

Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Comprehensive analysis are provided. The algorithm mixes a few mechanism including gradient clipping, error feedback, and momentum under arbitrary client heterogeneity, and convergence is proven without boundedness assumptions which is somehow novel. 2. Experiments are provided to show validity and superiority of the proposed algorithm.

Weaknesses

1. The main weakness is, as acknowledged in the paper, the algorithm can not benefit from privacy amplification by subsampling. This makes the privacy-utility trade-off way worse. 2. Gradient distribution is assumed to be sub-gaussian, which is still a relatively restrictive assumption. 2. Experiments are in limited scale and limited variants.

Reviewer 02Rating 2Confidence 4

Strengths

The manuscript provides extensive theoretical analysis for smooth non-convex distributed objectives under arbitrary data heterogeneity, and Clip21-SGD2M achieves the optimal O(1/T) in the full-batch regime while maintaining a competitive local DP guarantee.

Weaknesses

1. There are several prior works that has explored the application of error feedback 21 (EF21) in the DP setup. For example, I noted that the structure proposed in this paper is almost identical to that shown in "Smoothed Normalization for Efficient Distributed Private Optimization" while the latter paper seems to include some further improvement through normalization. Given the different assumptions on gradient (this paper assumes stronger sub-Gaussian), I cannot easily compare the results and

Reviewer 03Rating 8Confidence 3

Strengths

This is a high-quality, technically dense paper that identifies a clear failure mode in existing methods, proposes a novel and well-motivated algorithm to fix it, and provides good theoretical and empirical validation. Theorem 2.2 proves that a recent and related algorithm, Clip21-SGD fails in the stochastic setting. The proposed algorithm combines three existing ideas to achieve optimal convergence rates in the challenging but realistic setting of arbitrary data heterogeneity and stochastic gra

Weaknesses

The primary weakness of the paper is that the entire theoretical analysis assumes a full-participation model, where all clients participate at every round. This weakness is transparently acknowledged by the authors. The experiment in Fig 5 partially alleviates this by providing at least some preliminary empirical support for the effectiveness of the algorithm under subsampling. Also, the experiments with DP were run only on MNIST -- even rerunning these experiments on CIFAR would strengthen the

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research

MethodsLogistic Regression