DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing
Liangyu Wang, Huanyi Xie, Di Wang

TL;DR
DistZO2 is a distributed framework that significantly improves the throughput of zeroth-order fine-tuning of large language models by combining parallel strategies and hardware-aware communication, enabling scalable and memory-efficient training.
Contribution
It introduces a novel 2D parallelism approach and hardware-aware communication strategies to scale zeroth-order fine-tuning of LLMs across multiple GPUs.
Findings
Achieves 3x speedup over ZO2 on OPT-175B.
Maintains memory efficiency of zeroth-order methods.
Scales zeroth-order fine-tuning to multi-GPU systems.
Abstract
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CPU memory and overlapping transformer block transfer with dual forward computation on a single GPU. However, ZO2 remains limited by its single-device execution and achieves modest throughput. In this work, we present DistZO2, a high-throughput, memory-efficient framework for distributed zeroth-order fine-tuning of LLMs. DistZO2 introduces three parallel strategies: (1) Perturbation Parallelism (PertP), which parallelizes the two perturbed forward passes across devices; (2) Distributed Data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
