SlimPajama-DC: Understanding Data Combinations for LLM Training

Zhiqiang Shen; Tianhua Tao; Liqun Ma; Willie Neiswanger and; Zhengzhong Liu; Hongyi Wang; Bowen Tan; Joel Hestness; Natalia; Vassilieva; Daria Soboleva; Eric Xing

arXiv:2309.10818·cs.CL·May 10, 2024·2 cites

SlimPajama-DC: Understanding Data Combinations for LLM Training

Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger and, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia, Vassilieva, Daria Soboleva, Eric Xing

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper analyzes how different data combination strategies, including deduplication methods and source proportions, impact the training performance of large language models using the SlimPajama dataset.

Contribution

It provides an empirical analysis of data deduplication effects and optimal data mixing strategies for large language model training with SlimPajama.

Findings

01

Global deduplication improves model performance.

02

High data diversity enhances results after deduplication.

03

Best data configuration outperforms RedPajama baseline.

Abstract

This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cerebras/modelzoo
pytorchOfficial

Models

🤗
MBZUAI-LLM/SlimPajama-DC
model· 10 dl· ♡ 2
10 dl♡ 2

Datasets

MBZUAI-LLM/SlimPajama-627B-DC
dataset· 6.7k dl
6.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management

MethodsSwiGLU · Attention with Linear Biases