Dataset Decomposition: Faster LLM Training with Variable Sequence Length   Curriculum

Hadi Pouransari; Chun-Liang Li; Jen-Hao Rick Chang; Pavan Kumar; Anasosalu Vasu; Cem Koc; Vaishaal Shankar; Oncel Tuzel

arXiv:2405.13226·cs.CL·January 8, 2025·1 cites

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, Pavan Kumar, Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces dataset decomposition, a variable sequence length training method for large language models that reduces computational costs and improves performance by sampling from buckets of different sequence lengths with a curriculum.

Contribution

We propose a novel dataset decomposition technique that enables efficient training of LLMs with variable sequence lengths, outperforming the traditional concat-and-chunk approach.

Findings

01

Training with dataset decomposition is up to 6x faster than baseline.

02

Our method achieves comparable or better accuracy on language and long-context benchmarks.

03

Efficient long-sequence training scales well with dataset size.

Abstract

Large language models (LLMs) are commonly trained on datasets consisting of fixed-length token sequences. These datasets are created by randomly concatenating documents of various lengths and then chunking them into sequences of a predetermined target length (concat-and-chunk). Recent attention implementations mask cross-document attention, reducing the effective length of a chunk of tokens. Additionally, training on long sequences becomes computationally prohibitive due to the quadratic cost of attention. In this study, we introduce dataset decomposition, a novel variable sequence length training technique, to tackle these challenges. We decompose a dataset into a union of buckets, each containing sequences of the same size extracted from a unique document. During training, we use variable sequence length and batch-size, sampling simultaneously from all buckets with a curriculum. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-dataset-decomposition
noneOfficial

Models

🤗
apple/DCLM-7B-8k
model· 9 dl· ♡ 45
9 dl♡ 45

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Topic Modeling