Markovian Compression: Looking to the Past Helps Accelerate the Future
Andrey Veprikov, Vladimir Solodkin, Mikhail Rudakov, Petr Babkin, Aleksandr Beznosikov

TL;DR
This paper introduces Markovian compressors for distributed optimization, improving convergence and efficiency in communication-constrained settings by leveraging past information, with theoretical guarantees and empirical validation.
Contribution
It proposes a novel family of Markovian compressors integrated into QSGD and momentum QSGD, with convergence analysis for various convexity conditions.
Findings
Markovian compressors outperform existing schemes in experiments.
The algorithms converge under non-convex and convex conditions.
Practical improvements observed on CIFAR-10 and GLUE datasets.
Abstract
This paper deals with distributed optimization problems that use compressed communication to achieve efficient performance and mitigate communication bottleneck. We propose a family of compression schemes in which operators transform vectors fed to their input according to a Markov chain, i.e. the stochasticity of the compressors depends on previous iterations. The compressors are implemented in the vanilla Quantized Stochastic Gradient Descent algorithm (QSGD), and, to further improve the efficiency and convergence rate, in the momentum accelerated QSGD. We provide convergence results for our algorithms with Markovian compressors, the analysis covers non-convex, Polyak-Lojasiewicz, and strongly convex cases. To demonstrate the applicability of our approach to distributed data-parallel optimization problems, we conduct experiments on the CIFAR-10 and GLUE datasets with the Resnet-18 and…
Peer Reviews
Decision·Submitted to ICLR 2025
I consider Markovian compressors to be a novel idea, but it is possible that I have missed some related literature. The proposed algorithm is easy to implement and fairly understandable, which might broaden its interest to general readers. The math looks correct to me, though I have not checked every detail.
My main concern is that the theoretical analysis does not seem to touch the real advantage of using a Markovian compressor. It appears that the paper resorted to stationarity analysis, but the stationary distribution of the Markov chain simply falls back to the standard random sparsification. It is thus unclear what is gained by a Markovian compressor. Although a better analysis could be challenging, as argued in the paper, the significance of the proposed algorithm would be greatly strengthened
Let me start by stating that I am not *at all* an expert in optimisation, even less in distributed setting. I will therefore not judge the novelty of the paper as I have no clue. Given this, I have found the paper interesting. The proposed idea is very natural and quite simple. The theoretical guarantees are strong and provide an clear view on convergence properties (I think). The numerics are convincing.
I was disappointed by the quality of presentation. E.g., the text is sometimes quite cryptic and there are many linking words missing. Some definitions are missing too, and I had to guess a number of times what the authors meant. I recommend making an effort in presentation. In spite I have found the problem in interesting in itself, it is not clear for me that ICLR is the right place to publish this paper. But the AC and authors probably know more about this than me, this is not a strong view.
The paper present convergence bounds for the proposed compressors and distributed optimization approaches. There bounds can be valuable for analyzing similar optimization approaches.
* The presentation of the proof part need to be improved. In its current form, the proof is difficult to follow. It is suggested that for each inequality, the author may provide a clue that which fact(s) has been used to get the inequality. It may also be helpful that the author provide more proof sketches. * It seems that the convergence bounds hold for any unbiased compressors or asymptotic unbiased compressors. Then, how the author justify that BanLast and KAWASAKI are near-optimal among all
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Error Correcting Code Techniques
