QuRating: Selecting High-Quality Data for Training Language Models
Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

TL;DR
This paper introduces QuRating, a data selection method that uses learned quality ratings based on human-like judgments to improve language model training, balancing quality and diversity for better performance.
Contribution
We propose QuRating, a novel approach that learns scalar quality ratings from pairwise judgments to select high-quality training data for language models.
Findings
Models trained on high-quality data show lower perplexity.
Quality-based sampling enhances in-context learning.
Educational value ratings improve training efficiency.
Abstract
Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- princeton-nlp/QuRating-GPT3.5-Judgmentsdataset· 16 dl16 dl
- princeton-nlp/QuRating-GPT3.5-Judgments-Testdataset· 23 dl23 dl
- princeton-nlp/QuRatedPajama-1B_tokens_for_analysisdataset· 70 dl70 dl
- princeton-nlp/QuRatedPajama-260Bdataset· 55 dl55 dl
- OpenDataArena/OpenDataArena-scored-data-2603dataset· 217 dl217 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
