Mitigating Non-IID Drift in Zeroth-Order Federated LLM Fine-Tuning with Transferable Sparsity
Yide Ran, Wentao Guo, Jingwei Sun, Yanzhou Pan, Xiaodong Yu, Hao Wang, Jianwen Xie, Yiran Chen, Denghui Zhang, Zhaozhuo Xu

TL;DR
This paper introduces Meerkat, a sparse zeroth-order optimization method for federated LLM fine-tuning that reduces communication costs and mitigates Non-IID data challenges through high-frequency synchronization and client analysis.
Contribution
It proposes a novel sparse ZO approach with a virtual path mechanism to handle Non-IID drift and improve federated LLM fine-tuning efficiency and performance.
Findings
Meerkat achieves high communication efficiency and outperforms full-parameter ZO.
Meerkat-vp effectively identifies extreme Non-IID clients using GradIP trajectories.
High-frequency synchronization mitigates Non-IID data challenges.
Abstract
Federated Learning enables collaborative fine-tuning of Large Language Models (LLMs) across decentralized Non-Independent and Identically Distributed (Non-IID) clients, but such models' massive parameter sizes lead to significant memory and communication challenges. This work introduces Meerkat, a sparse zeroth-order optimization (ZO) method designed for federated LLM fine-tuning. By limiting fine-tuning to a transferable, static, extremely sparse subset of parameters, Meerkat achieves remarkable communication efficiency, enabling cost-effective high-frequency synchronization. With theoretical analysis and experiments, we show that this high-frequency communication effectively mitigates Non-IID data challenges and leads to superior performance compared to full-parameter ZO. Furthermore, experiment results show that Meerkat outperforms existing sparsity baselines with better performance…
Peer Reviews
Decision·ICLR 2026 Poster
1. Virtual path and GradIP are novel ideas. By reconstructing each client’s local update trajectory, the server gains visibility into how that client is moving in parameter space without seeing private data. GradIP then becomes a quantitative signal that reveals which clients are extreme Non IID. 2. The paper provides convergence analyses for both MEERKAT and MEERKAT VP. 2. Strong empirical validation. The experiments cover multiple open LLMs (Llama 3.2 1B, Qwen2 1.5B, Gemma2 2B), multiple tas
1. hyperparameters in MEERKAT VP : The early stopping rule uses thresholds on GradIP phase ratios and quiescent duration. Although a small sensitivity study is reported, it is still unclear how practitioners should pick these thresholds for new tasks with no oracle labels. 2. Baselines in experiments: There is active work on federated LoRA under heterogeneous clients, which explicitly tackles aggregation noise, knowledge contamination, and aggregation distortion by separating global and clie
i.MEERKAT employs an extremely sparse and static subset of parameters for fine-tuning. This drastically reduces the communication load and memory consumption on client devices, addressing the primary scalability bottleneck of LLMs in FL. ii.The extreme sparsity enables cost-effective high-frequency client-server synchronization. This high synchronization rate is key to effectively suppressing the client drift caused by Non-IID data distributions across decentralized clients. iii.The paper prop
i.In the description of the steps in Figure 1 of the paper, the word "Aggregrate" appears, and the correct spelling should be "Aggregate".In the phrase on lines 153–154, "(3) Sever aggregates and initiate the next round...", shouldn't the word "Sever" actually be spelled "Server"?Line 165 contains a redundant repetition of the article "a" in the phrase "and a a new seed list." ii.Although the paper validates the effectiveness of the extreme sparsity level of 0.1%, it fails to deeply analyze the
* The paper is clearly presented, systematically proposing three distinct claims and validating them one by one. * It includes a detailed convergence analysis for the proposed methods. * The experimental evaluation is comprehensive, covering various benchmarks and recent LLMs.
* The use of sparse updates for communication efficiency in FL is a well-explored area, often framed as model compression. The paper should provide a more thorough comparison with this body of work to clarify the unique contributions of applying sparsity specifically to ZO-based FL fine-tuning. Applying sparsity to new FL models (here is federated fine-tuning) seems not a strong contribution. * Methods for tackling the Non-IID issue in compressed FL have also been developed, such as the one in [
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Privacy-Preserving Technologies in Data · Advanced Data and IoT Technologies
MethodsEarly Stopping
