Large Language Model-guided Document Selection

Xiang Kong; Tom Gunter; Ruoming Pang

arXiv:2406.04638·cs.CL·June 10, 2024

Large Language Model-guided Document Selection

Xiang Kong, Tom Gunter, Ruoming Pang

PDF

Open Access

TL;DR

This paper introduces a scalable document selection method using LLMs as graders to filter large corpora, enabling high-quality model training with significantly reduced computational costs.

Contribution

It presents a novel approach that employs prompted LLMs to label and filter training data, reducing training data size by 75% while maintaining performance across benchmarks.

Findings

01

Filtering achieves comparable performance with 70% fewer FLOPs.

02

Better LLM labelers and classifiers improve results and reduce prompt sensitivity.

03

In-context learning enhances performance of less-capable labeling models.

Abstract

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Data Quality and Management