Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy
Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

TL;DR
This paper introduces Clip21-SGD2M, a novel federated learning algorithm combining clipping, momentum, and error feedback, achieving both optimal differential privacy and strong optimization guarantees in heterogeneous, non-convex settings.
Contribution
The paper proposes Clip21-SGD2M, the first method to simultaneously attain optimal differential privacy and convergence guarantees in federated learning with heterogeneous data.
Findings
Outperforms baselines in non-convex logistic regression
Achieves near-optimal differential privacy guarantees
Demonstrates superior optimization performance in neural network training
Abstract
Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Comprehensive analysis are provided. The algorithm mixes a few mechanism including gradient clipping, error feedback, and momentum under arbitrary client heterogeneity, and convergence is proven without boundedness assumptions which is somehow novel. 2. Experiments are provided to show validity and superiority of the proposed algorithm.
1. The main weakness is, as acknowledged in the paper, the algorithm can not benefit from privacy amplification by subsampling. This makes the privacy-utility trade-off way worse. 2. Gradient distribution is assumed to be sub-gaussian, which is still a relatively restrictive assumption. 2. Experiments are in limited scale and limited variants.
The manuscript provides extensive theoretical analysis for smooth non-convex distributed objectives under arbitrary data heterogeneity, and Clip21-SGD2M achieves the optimal O(1/T) in the full-batch regime while maintaining a competitive local DP guarantee.
1. There are several prior works that has explored the application of error feedback 21 (EF21) in the DP setup. For example, I noted that the structure proposed in this paper is almost identical to that shown in "Smoothed Normalization for Efficient Distributed Private Optimization" while the latter paper seems to include some further improvement through normalization. Given the different assumptions on gradient (this paper assumes stronger sub-Gaussian), I cannot easily compare the results and
This is a high-quality, technically dense paper that identifies a clear failure mode in existing methods, proposes a novel and well-motivated algorithm to fix it, and provides good theoretical and empirical validation. Theorem 2.2 proves that a recent and related algorithm, Clip21-SGD fails in the stochastic setting. The proposed algorithm combines three existing ideas to achieve optimal convergence rates in the challenging but realistic setting of arbitrary data heterogeneity and stochastic gra
The primary weakness of the paper is that the entire theoretical analysis assumes a full-participation model, where all clients participate at every round. This weakness is transparently acknowledged by the authors. The experiment in Fig 5 partially alleviates this by providing at least some preliminary empirical support for the effectiveness of the algorithm under subsampling. Also, the experiments with DP were run only on MNIST -- even rerunning these experiments on CIFAR would strengthen the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
MethodsLogistic Regression
