GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping
Yishu Yin, Xuehai Qian

TL;DR
GreedySnake is a novel SSD-offloaded training system for large language models that employs vertical scheduling and optimizer step overlapping, significantly improving throughput on GPU clusters.
Contribution
It introduces a new vertical scheduling approach and optimizer step overlapping to enhance SSD-offloaded LLM training efficiency.
Findings
Achieves nearly double the training throughput of existing systems on GPT-65B and GPT-175B models.
Effectively mitigates I/O bottlenecks with overlapping techniques.
Demonstrates significant performance gains on A100 GPU clusters.
Abstract
SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management
