Large Language Model-guided Document Selection
Xiang Kong, Tom Gunter, Ruoming Pang

TL;DR
This paper introduces a scalable document selection method using LLMs as graders to filter large corpora, enabling high-quality model training with significantly reduced computational costs.
Contribution
It presents a novel approach that employs prompted LLMs to label and filter training data, reducing training data size by 75% while maintaining performance across benchmarks.
Findings
Filtering achieves comparable performance with 70% fewer FLOPs.
Better LLM labelers and classifiers improve results and reduce prompt sensitivity.
In-context learning enhances performance of less-capable labeling models.
Abstract
Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Data Quality and Management
