Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar, Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

TL;DR
This paper introduces dataset decomposition, a variable sequence length training method for large language models that reduces computational costs and improves performance by sampling from buckets of different sequence lengths with a curriculum.
Contribution
We propose a novel dataset decomposition technique that enables efficient training of LLMs with variable sequence lengths, outperforming the traditional concat-and-chunk approach.
Findings
Training with dataset decomposition is up to 6x faster than baseline.
Our method achieves comparable or better accuracy on language and long-context benchmarks.
Efficient long-sequence training scales well with dataset size.
Abstract
Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch-size, sampling simultaneously from all buckets with a curriculum. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Topic Modeling
