TL;DR
RoundPipe is a novel pipeline scheduling method that enables efficient, near-zero-bubble training of large language models on consumer GPUs by dynamically dispatching computation stages across devices.
Contribution
It introduces RoundPipe, which breaks the weight binding constraint, allowing flexible GPU utilization and significantly improving training speed for large models.
Findings
Achieves 1.48--2.16× speedup over state-of-the-art baselines.
Enables fine-tuning of 1.7B to 32B models on consumer GPUs.
Supports LoRA fine-tuning of Qwen3-235B with 31K sequence length on a single server.
Abstract
Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round-robin manner, achieving a near-zero-bubble pipeline. To ensure training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
