Non-Convex Federated Optimization under Cost-Aware Client Selection
Xiaowen Jiang, Anton Rodomanov, Sebastian U. Stich

TL;DR
This paper introduces a cost-aware federated optimization model that accounts for communication and computation costs, proposing a new algorithm with optimal complexities for non-convex problems.
Contribution
It develops a novel federated optimization framework that explicitly models client selection costs and presents an algorithm with state-of-the-art communication and local computation efficiency.
Findings
Proposes a cost-aware federated optimization model.
Introduces RG-SAGA, an improved gradient estimator.
Achieves best-known complexities for non-convex federated optimization.
Abstract
Different federated optimization algorithms typically employ distinct client-selection strategies: some methods communicate only with a randomly sampled subset of clients at each round, while others need to periodically communicate with all clients or use a hybrid scheme that combines both strategies. However, existing metrics for comparing optimization methods typically do not distinguish between these strategies, which often incur different communication costs in practice. To address this disparity, we introduce a simple and natural model of federated optimization that quantifies communication and local computation complexities. This new model allows for several commonly used client-selection strategies and explicitly associates each with a distinct cost. Within this setting, we propose a new algorithm that achieves the best-known communication and local complexities among existing…
Peer Reviews
Decision·ICLR 2026 Oral
- The proofs incorporating the SAGA estimator are quite nice, the notion of variance the authors define in the line right under (E.9) is different from the notion of variance used in the SVRG estimator (which is also used by prior work). This is quite novel and I believe to be a very useful insight. - The paper explicitly includes details on how to solve the local problems. Even though the stopping criterion requires knowing many problem parameters, this is still appreciated as much prior work i
- It is quite surprising that the algorithm just chooses the first function at each timestep to calculate the prox with respect to it. This seems suboptimal (e.g. what if that first function is just extra dissimilar? The avg similarity can be low and this one function could just be an outlier). - The accuracy of CIFAR10 in the experiments section is far too low. 77%? A three minute run with SGD can reach 96% (see https://github.com/KellerJordan/cifar10-airbench). I suggest the baselines should b
The work demonstrates that SAGA-type methods can exploit second-order similarity, which is a novel theoretical contribution. The work designs a method that does not require frequent full synchronization, relevant to practical applications. The paper provides a thorough theoretical analysis, including a new framework for comparing algorithm complexities. The theoretical claims are well-supported by numerical experiments on both synthetic and real-world datasets (LIBSVM, EMNIST, CIFAR-10). The pro
The authors mention scenarios where the second-order dissimilarity constant $\delta$ is smaller than $L_{\max}.$ However, in many practical FL settings with highly distinct data distributions client data can be dissimilar. The analysis does not extend to the stochastic setting, which limits its direct applicability to problems involving online learning at the client level. The proposed method CGM-RG-SAGA is a combination of well known techniques such as Composite Gradient Method, SAGA variance
- **Originality.** Introduces an RG-SAGA/SVRG estimator composition inside CGM that explicitly leverages $\delta$-SOD; the variance recursion is sharpened by a factor $ns$ relative to classic SAGA analyses (Cor. 3.7), which is technically neat and nontrivial. - **Quality.** Theory is stated with clear assumptions ($\delta$-SOD, $\Delta_1$-ED) and tracks the impact of estimator error and subproblem accuracy on iteration complexity (Theorem 3.8). The communication model (ACO/RCO/DCO) is formalize
- **Novelty gaps vs. prior variance-reduced FL.** The “recursive” flavor has strong parallels to PAGE/SARAH-type recursions used in SABER and related CGM variants; several ingredients (e.g., plugging VR gradient trackers into CGM, periodic/full-grad syncs) exist in recent work. The paper’s comparison table is helpful, but the “best known” claim would benefit from a tighter, side-by-side theorem-level comparison against SABER’s second-order-aware CGM under matching assumptions and oracles - **Ass
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques · Cryptography and Data Security
