TL;DR
Skrull introduces a dynamic data scheduling method that significantly improves training efficiency for long-context fine-tuning of large language models by balancing computation across heterogeneous sequence lengths.
Contribution
This paper presents Skrull, a novel lightweight data scheduler designed to optimize long-context fine-tuning by addressing data heterogeneity challenges in LLM training.
Findings
Skrull outperforms DeepSpeed by up to 7.54x in training efficiency.
The scheduling algorithm achieves near-zero online scheduling cost.
Experimental results validate the effectiveness of Skrull in real-world scenarios.
Abstract
Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
