DataMan: Data Manager for Pre-training Large Language Models

Ru Peng; Kexin Yang; Yawen Zeng; Junyang Lin; Dayiheng Liu; Junbo Zhao

arXiv:2502.19363·cs.CL·April 9, 2025

DataMan: Data Manager for Pre-training Large Language Models

Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DataMan, a data management system that uses quality criteria and domain recognition to improve large language model pre-training, resulting in better performance with less data and enhanced domain-specific capabilities.

Contribution

We propose a novel Data Manager (DataMan) that learns quality ratings and domain recognition to optimize pre-training data selection for large language models.

Findings

01

DataMan improves in-context learning and perplexity over baselines.

02

High-rated, domain-specific data enhances domain-specific ICL performance.

03

Quality criteria are complementary and weakly correlated with perplexity.

Abstract

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The motivation of using LLMs to self-select the criterias that are beneficial for model's performance is straightforward and easy to understand. The overall framework makes sense. 2. The description of the method is clear and easy to follow. 3. Data selection for both pre-training, instruction-tuning, and SFT is an important direction for LLM applications, where I think the direction in this paper is interesting and could be beneficial for downstream applications. The experiments show that th

Weaknesses

1. I think the motivation to use 14 scores is still a little bit unclear. It would be beneficial to have an analysis of why these scores were chosen and how these scores can distinctively differentiate models' abilities when evaluation.

Reviewer 02Rating 6Confidence 3

Strengths

- The paper uses a novel "reverse thinking" concept which derives quality criteria using documents filtered into high and low perplexity buckets, which avoids pitfalls of criteria based on human intution. - Introduces 13 detailed criteria for quality, which moves beyond typical heuristic approaches which threshold on perplexity, deduplication, etc. - The focus on 13 specific quality metrics and their annotations makes it clear why a particular document was chosen by the model for inclusion in t

Weaknesses

- The authors apply DataMan to a 30B subset of SlimPajama, which is a relatively small dataset. Further work should be done to validate that the method works for large pretraining-scale datasets. - The SlimPajama dataset has already been filtered for quality with duplicate documents removed. It’s not clear if this influences the analysis of the method or if similar improvements would be seen applying DataMan to unfiltered web data. - The paper does not discuss the inference FLOPs required to run

Reviewer 03Rating 5Confidence 3

Strengths

- Use LLMs for filtering pretrained data as opposed to using heuristics. - Promise to release code, models and datasets, which would benefit the research community especially given the cost of obtaining such models and datasets.

Weaknesses

- **Novelty**: [1] already proposed to train a model to rate and select high-quality pre-training data based on four qualities: "writing style", "required expertise", "facts & trivia", and "educational value". The main differences seem to be on the considered criteria, the model used (Qwen2-1.5B vs Sheared-Llama-1.3B) and using pointwise instead of pairwise ratings. [1] Wettig et al., QuRating: Selecting High-Quality Data for Training Language Models, 2024.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques · Topic Modeling