Controlled disagreement improves generalization in decentralized training

Zesen Wang; Mikael Johansson

arXiv:2602.02899·cs.LG·February 4, 2026

Controlled disagreement improves generalization in decentralized training

Zesen Wang, Mikael Johansson

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DSGD-AC, a decentralized training method that intentionally maintains consensus errors, which act as structured perturbations guiding the model toward flatter minima and improving generalization.

Contribution

It presents a novel decentralized SGD algorithm with adaptive consensus that leverages consensus errors as beneficial regularizers, challenging traditional views on decentralized training.

Findings

01

DSGD-AC outperforms standard DSGD and centralized SGD in accuracy.

02

Consensus errors align with the dominant Hessian subspace.

03

Structured consensus errors guide optimization toward flatter minima.

Abstract

Decentralized training is often regarded as inferior to centralized training because the consensus errors between workers are thought to undermine convergence and generalization, even with homogeneous data distributions. This work challenges this view by introducing decentralized SGD with Adaptive Consensus (DSGD-AC), which intentionally preserves non-vanishing consensus errors through a time-dependent scaling mechanism. We prove that these errors are not random noise but systematically align with the dominant Hessian subspace, acting as structured perturbations that guide optimization toward flatter minima. Across image classification and machine translation benchmarks, DSGD-AC consistently surpasses both standard DSGD and centralized SGD in test accuracy and solution flatness. Together, these results establish consensus errors as a useful implicit regularizer and open a new…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper is easy to follow. The algorithm presented in Algorithm 1 is precise. The connection between the consensus error and the Hessian subspace is conceptually interesting.

Weaknesses

1. **Overly strong assumptions** The theoretical analysis relies on the assumption of *data homogeneity*—that all agents share identical local objectives $ f_i = f_j $. This assumption weakens the theoretical contribution. 2. **Lack of theoretical analysis for the adaptive scalar** The adaptive scaling factor is taken as $\gamma^{(t)} = [\alpha^{(t)}/\alpha_{\max}]^p$. The theoretical analysis in this paper is not strong enough. The paper contains two propositions in the main secti

Reviewer 02Rating 4Confidence 4

Strengths

The method is interesting and well motivated, though it is a bit simple. The theory gives insightful message for designing algorithms though (still) a bit simple.

Weaknesses

My first concern is that the theory has limited novelty and merit given an existing paper [Zhu et al 2023]. The previous paper has calculated the consensus error and proved the asymptotic equivalence between D-SGD and SAM. The new part in this paper is injecting a lambda to make the consensus error non-diminishing. Further, the experimental results are not strong enough. The authors did not compare different topologies and hyperparameter alpha, two key tunable factors in this paper. The perform

Reviewer 03Rating 4Confidence 4

Strengths

* The core observation is novel. Building on the work of Zhu et al. (2023), the idea that the implicit quadratic loss from the gossip phase should have its own adaptive scaling factor (decoupled from the main learning rate) is an insightful contribution. * The paper provides a theoretical justification for the algorithm's design. Proposition 1 formally demonstrates that DSGD-AC, with the proposed scaling, can maintain non-vanishing consensus errors, which is the mechanism intended to preserve th

Weaknesses

* The paper's primary weakness lies in its experimental validation. The experiments are restricted to a very specific and limited setting: 8 nodes in a one-peer ring topology. This setup does not provide sufficient evidence for the method's effectiveness or scalability in more general or larger-scale scenarios (e.g., more nodes, different graph topologies). * The performance improvement on the machine translation task is very small. In Table 2, the BLEU score for DAdam-AC is only marginally bett

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Face and Expression Recognition