Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization
Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang

TL;DR
This paper introduces HiSo, a Hessian-informed zeroth-order federated optimization method that accelerates convergence and reduces communication costs in federated learning, especially for large language models, by leveraging global Hessian approximations without increasing communication overhead.
Contribution
The paper proposes HiSo, a novel Hessian-informed ZO federated optimization method that accelerates convergence using global Hessian approximations while maintaining scalar-only communication.
Findings
HiSo achieves 1-5x faster convergence in communication rounds compared to state-of-the-art ZO-FL methods.
Theoretically, HiSo's convergence rate is independent of the Lipschitz constant and model dimension under certain assumptions.
Empirical results demonstrate significant communication savings and faster fine-tuning of large language models.
Abstract
Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose HiSo, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication without transmitting any second-order information. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant and model dimension under some Hessian approximation assumptions, offering a plausible explanation for the observed…
Peer Reviews
Decision·ICLR 2026 Poster
1- The improved convergence by HiSo is significant in comparison to other ZO-FL benchmarks, while keeping the communication costs significantly small. The 1-5x speedup over DeComFL is a practical and valuable improvement. 2- The paper provides a theoretical analysis to support the method, proving a convergence rate independent of model dimension `d` for non-convex functions. 3- The proposed method addresses a clear and important bottleneck in federated LLM fine-tuning, namely the high communic
1- The experimental validation is limited to the OPT family of models (OPT-125M to OPT-2.7B). These models are somewhat dated and are known to be undertrained. The effectiveness of HiSo on newer, more capable models (e.g., smaller variants of LLaMA-3.2, Qwen-2.5, Gemma 3, or SmolLM) is not demonstrated. It is unclear if the same optimization behavior will hold on these new architectures. 2- The claim that HiSo is a "Hessian-informed" or "second-order" method is potentially misleading. The propo
The paper is well-written, with a well-motivated research goal and a clear description of the algorithm.
I have the following concerns regarding the paper: Theoretical Practicality and Depth: The theoretical analysis relies on a good approximation of the Hessian, yet the method employed in practice is only a diagonal approximation. This gap makes it difficult to appreciate the practical relevance of Theorem 1. Furthermore, a simple non-convex analysis seems insufficient, as it fails to capture the specific landscape properties of neural network loss functions. Experimental Comprehensiveness: The
1. Theoretical contribution: The paper proves non convex convergence bounds where the rate depends on a whitening rank related to the effective Hessian spectrum instead of the raw model dimension, and extends DeComFL style theory to multiple local steps per round. 2. Strong and concrete motivation: Existing scalar only ZO methods solve bandwidth but converge painfully slowly. HiSo squarely targets this convergence bottleneck without giving up the scalar only advantage. 3. The paper repeatedl
1. Benchmark scale and diversity:The main LLM experiments involve six clients with two sampled per round, and tasks are classification and extractive QA. Although these are standard NLP benchmarks and good stress tests for convergence and accuracy, they are still small compared to industrial federated networks across hospitals, phones, or enterprises. The paper would be stronger if it included either larger federations or at least a stress test with many more clients and skew patterns. 2. Back
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuantum Computing Algorithms and Architecture · Quantum-Dot Cellular Automata · DNA and Biological Computing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
