GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity
Artavazd Maranjyan, Mher Safaryan, Peter Richt\'arik

TL;DR
GradSkip is a novel distributed optimization method that reduces communication costs by allowing clients to perform fewer local gradient steps based on data importance, maintaining convergence and acceleration.
Contribution
It introduces GradSkip, a flexible variant of ProxSkip that adapts local steps per client without losing convergence speed, and extends it to GradSkip+ with broader applicability.
Findings
Converges linearly under standard assumptions.
Maintains accelerated communication complexity.
Reduces local steps for less important data.
Abstract
We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing clients to perform multiple local gradient-type training steps before communication. In a recent breakthrough, Mishchenko et al. (2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime without relying on any data similarity assumptions. However, their ProxSkip method requires all clients to take the same number of local training steps in each communication round. We propose a redesign of the ProxSkip method, allowing clients with ``less important'' data to get away with fewer local training steps without impacting the overall communication complexity of the method. In particular, we prove that our modified method, GradSkip, converges linearly under the same assumptions and has the…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The key novelty lies in the newly introduced client-wise randomness, which induces fake local steps and less local steps (Lemma 3.1 and 3.2), the idea is elegant. 2. Better computation complexity.
1. Compared to ProxSkip (Mishchenko et al. (2022)), the algorithm here requires finer structure information from the devices, i.e., individualized function smoothness parameters, while ProxSkip only requires a global smoothness parameter. And all clients are required to coordinate in advance to know the global information $\kappa_{\max}$, which may be a bit unrealistic. 2. According to Theorem 3.6, the client gradient query number is improved from $\sqrt{\kappa_{\max}}$ to $\min(\kappa_i, \sqrt{
1. The proposed Gradskip method and its extensions modify Scaffnew by allowing skipping local gradient computation and improve the local gradient computation complexity to $O(\min(\sqrt{\kappa_{\max}},\kappa_i)\log(1/\epsilon))$ from $O(\sqrt{\kappa_{\max}}\log(1/\epsilon))$, while still achieving the optimal communication complexity $\sqrt{\kappa}\log(1/\epsilon)$. I suggest the authors summarize their results and existing work in table. 2. Allowing skipping gradient computation is helpful to a
1. The novelty of this paper looks somewhat limited. The novelty and main contribution is that Gradskip doesn't always compute local gradient and thus requires $O(\min(\sqrt{\kappa_{\max}},\kappa_i)\log(1/\epsilon))$ proposes Gradskip, instead of $O(\sqrt{\kappa_{\max}}\log(1/\epsilon))$. However, the framework and analysis of proposed Gradskip is similar to Scaffnew. 2. The improvement on computational cost heavily depends on the values of $q_i$, which rely on $\kappa_i$. However, Remark 3.3 sa
1. A new local gradient-type method for distributed optimization with communication and computation constraints is proposed in this work, which is the extension of the ProxSkip method. The proposed method inherits the same accelerated communication complexity from ProxSkip while further improving computational complexity. 2. And two variants of the proposed method, i.e., GradSkip+ and VR-GradSkip+ are proposed.
1. The assumption that functions $f_i(x)$ are strongly convex is too strong since many functions will not satisfy this assumption when utilizing neural networks. 2. Lack of theoretical analysis of the communication complexity of the proposed method. In distributed optimization, communication complexity is crucial for minimizing inter-node communication to enhance system efficiency and reduce communication costs. 3. The experimental results are limited, the authors should conduct more experimen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
