Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization

Zhe Li; Bicheng Ying; Zidong Liu; Chaosheng Dong; Haibo Yang

arXiv:2405.15861·cs.LG·June 4, 2025

Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization

Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces DeComFL, a dimension-free federated learning algorithm using zeroth-order optimization, drastically reducing communication costs regardless of model size, with theoretical guarantees and practical validation on large models.

Contribution

DeComFL is the first dimension-free communication algorithm for federated learning leveraging zeroth-order methods, achieving constant communication per round and theoretical convergence guarantees.

Findings

01

Reduces communication from O(d) to O(1) per round

02

Achieves linear speedup with number of clients and local steps

03

Demonstrates significant practical reductions in communication overhead

Abstract

Federated Learning (FL) offers a promising framework for collaborative and privacy-preserving machine learning across distributed data sources. However, the substantial communication costs associated with FL significantly challenge its efficiency. Specifically, in each communication round, the communication costs scale linearly with the model's dimension, which presents a formidable obstacle, especially in large model scenarios. Despite various communication-efficient strategies, the intrinsic dimension-dependent communication cost remains a major bottleneck for current FL implementations. This paper proposes a novel dimension-free communication algorithm - DeComFL, which leverages the zeroth-order optimization techniques and reduces the communication cost from $O (d)$ to $O (1)$ by transmitting only a constant number of scalar values between clients and the server in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper is generally well-written and has a good flow. 2. The convergence analysis is necessary and duly provided. The discussion on the effective rank assumption to improve the pessimistic convergence bound is interesting. I did not check through the details for the correctness of the proof. 3. The algorithm design is sound.

Weaknesses

1. I am not convinced about the critical role of zeroth-order optimization in the problem setting to reduce communication costs. 2. Parts about the related works and the experiments could be improved, as detailed below in the Questions.

Reviewer 02Rating 6Confidence 4

Strengths

The problem tackled is interesting and important and the proposed method saves a lot of communication (order of 1000s in experiments). Theoretical analysis allows to reason about potential communication savings during the overall course of training. Experiments are done on large models (up to OPT-1.3 B).

Weaknesses

1. The paper does not state how exactly the random seeds are chosen, which might affect the distribution of the generated sequence. As far as I know, random generators guarantee the distribution of sampling a sequence of numbers from the same generator initialized once at some random seed, however with each number having its own random generator with its own random seed, I am not sure what guarantees exist and I imagine it depends on the distributions of the random seeds and particular implemen

Reviewer 03Rating 6Confidence 4

Strengths

It is quite novel to see the use of a zeroth order method for federated learning, and this paper makes a valuable contribution to this area. With small and clever modifications to the previous algorithm by Fang et al. (2022), this research effectively reduces the per-iteration communication costs to a constant for each agent. Supported by both theoretical and experimental evidence, this new method significantly outperforms FedAvg in terms of communications costs.

Weaknesses

The assumption made in Theorem 2 is not very standard. I am not sure if $\kappa$ can be truly seen as $O(1)$ constant and independent from $d$. What will be the consequence if $\kappa$ will scale up with $d$, even if it is not $\Theta(d)$? Minor: 1. I think the algorithm was stated for $P=1$. When reading pages 4 and 5, $P$ does not appear to be any part of the algorithm. It was confusing what role the constant $P$ plays in the algorithm. 2. In assumption 4, the second maximum should be over

Code & Models

Repositories

ZidongLiu/DeComFL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Cooperative Communication and Network Coding