Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning

Zhoubin Kou; Zihan Chen; Jing Yang; Cong Shen

arXiv:2601.09076·cs.LG·January 15, 2026

Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning

Zhoubin Kou, Zihan Chen, Jing Yang, Cong Shen

PDF

Open Access 5 Reviews

TL;DR

HERON-SFL introduces a hybrid zeroth- and first-order split federated learning framework that reduces client memory and computation costs while maintaining high accuracy, enabling resource-constrained devices to train larger models.

Contribution

It proposes a novel hybrid optimization method combining zeroth- and first-order techniques with theoretical convergence guarantees independent of model size.

Findings

01

Reduces client peak memory by up to 64%.

02

Lowers client-side compute cost by up to 33% per step.

03

Maintains benchmark accuracy on ResNet and language models.

Abstract

Split Federated Learning (SFL) enables collaborative training between resource-constrained edge devices and a compute-rich server. Communication overhead is a central issue in SFL and can be mitigated with auxiliary networks. Yet, the fundamental client-side computation challenge remains, as back-propagation requires substantial memory and computation costs, severely limiting the scale of models that edge devices can support. To enable more resource-efficient client computation and reduce the client-server communication, we propose HERON-SFL, a novel hybrid optimization framework that integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With the assistance of auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

Using ZO optimization on the client side show significant resource savings (up to 64% memory and 65% computation) with comparable accuracy, making it suitable for resource-constrained devices.

Weaknesses

1. The main contribution of HERON-SFL is an incremental update to Han et al.'s local-loss-based split learning. In Han et al., clients perform local updates using their own auxiliary model eliminating global backprop. HERON-SFL simply swaps the FO updates on the client-side with ZO updates while keeping the server-side optimization FO-based. 2. The core idea, that ZO optimization can replace FO optimization for memory reduction, is not new, as ZO optimization has been explored extensively before

Reviewer 02Rating 2Confidence 4

Strengths

The paper targets a highly relevant challenge of how to scale federated training to resource-limited clients without sacrificing global model quality. The proposed method preserves split learning’s efficiency benefits (clients train partial models) while avoiding its synchronization and overheads. The experiments are broad.

Weaknesses

1. The idea is very similar to FedGKT [1], which has not been compared with. There are many works that are follow ups of FedGKT, the authors need to cover some of the recent ones in their comparisons. 2. The smashed activations can leak data privacy. It has to be experimentally demonstrated how the proposed scheme is robust to model inversion and other attacks. 3. While not directly FL, there is another recent work that seeks to train small models at clients with support from server by offloadin

Reviewer 03Rating 4Confidence 1

Strengths

1. Theoretical analysis covers iid and non-iid together and also consider the specific split + federated learning settings. This thorough analysis framework will provide readership with a useful starting point of analyzing other algorithms. 2. Client-side resource consumptions are thoroughly analyzed (section 4.2), which clearly shows the benefits from the proposed method. 3. Experiments are fairly extensive covering training from scratch as well as fine-tuning.

Weaknesses

While I see mostly valueable contributions, still I find some limitations as follows. 1. [**Relative performance gain against conventional FL and conventional SplitLearning**] While the theoretical analysis and empirical results demonstrate the efficacy of the proposed method, I am not sure whether it consumes less resources than conventional FL to achieve the same target accuracy. What if the zeroth-order method is applied to a few input-side layers and the first-order method to the rest of th

Reviewer 04Rating 6Confidence 3

Strengths

* The idea of reducing memory and computation costs through zeroth-order optimization on the client side, while compensating for it with precise first-order optimization on the server side, is convincing. * The paper provides evidence, both experimental and theoretical, that the proposed hybrid approach can effectively reduce memory and computation costs without causing performance degradation. * The paper is well written and easy to follow.

Weaknesses

**The importance of client-side training.** I believe that the main reason the proposed hybrid optimization does not lead to a notable performance drop is that most of the learning still happens on the server using first-order optimization, while the client side primarily serves to “smash” the data for privacy. Because of this, it is difficult to clearly separate whether the benefit comes from the effectiveness of zeroth-order optimization itself, or simply from the fact that training the clien

Reviewer 05Rating 2Confidence 5

Strengths

+ Introduces HERON-SFL, the hybrid zeroth-order (ZO) and first-order (FO) optimization framework for Split Federated Learning (SFL). This hybridization smartly leverages ZO for clients (forward-only computation) and FO for servers (precise gradient updates), balancing computational efficiency and accuracy. + Provides a rigorous convergence analysis for HERON-SFL. Shows that under a low effective-rank assumption, the convergence rate becomes independent of model dimensionality, overcoming a majo

Weaknesses

- The experiments focus on ResNet-18 (CIFAR-10) and GPT-2 (E2E dataset) — both relatively moderate-scale tasks. There is no evaluation on truly large foundation models (e.g., GPT-3-scale or ViT-level networks) where the claimed scalability advantages would be more convincing. - Client-device heterogeneity (e.g., varying compute or network speeds) is not experimentally explored, which is critical in federated settings. - Although communication costs are analyzed, real-world communication latency,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · IoT and Edge/Fog Computing · Caching and Content Delivery