DataSculpt: Crafting Data Landscapes for Long-Context LLMs through Multi-Objective Partitioning
Keer Lu, Xiaonan Nie, Zheng Liang, Da Pan, Shusen Zhang, Keshi Zhao,, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang

TL;DR
DataSculpt is a novel data management framework that optimizes long-context training for large language models by multi-objective data partitioning, significantly improving various NLP task performances.
Contribution
It introduces a multi-objective combinatorial optimization approach for organizing training data, enhancing long-context capabilities without sacrificing overall performance.
Findings
Improved long-context tasks by up to 21%
Enhanced overall model proficiency by 4.88%
Effective data organization via multi-objective clustering
Abstract
In recent years, Large Language Models (LLMs) have demonstrated significant improvements across a variety of tasks, one of which is the long-context capability. The key to improving long-context performance lies in effective data organization and management strategies that integrate data from multiple domains and optimize the context window during training. Through extensive experimental analysis, we identified three key challenges in designing effective data management strategies that enable the model to achieve long-context capability without sacrificing performance in other tasks: (1) a shortage of long documents across multiple domains, (2) effective construction of context windows, and (3) efficient organization of large-scale datasets. To address these challenges, we introduce DataSculpt, a novel data management framework designed for long-context training. We first formulate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms · Manufacturing Process and Optimization · Simulation Techniques and Applications
