SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger and, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia, Vassilieva, Daria Soboleva, Eric Xing

TL;DR
This paper analyzes how different data combination strategies, including deduplication methods and source proportions, impact the training performance of large language models using the SlimPajama dataset.
Contribution
It provides an empirical analysis of data deduplication effects and optimal data mixing strategies for large language model training with SlimPajama.
Findings
Global deduplication improves model performance.
High data diversity enhances results after deduplication.
Best data configuration outperforms RedPajama baseline.
Abstract
This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
MethodsSwiGLU · Attention with Linear Biases
