ScalingFilter: Assessing Data Quality through Inverse Utilization of   Scaling Laws

Ruihang Li; Yixuan Wei; Miaosen Zhang; Nenghai Yu; Han Hu; Houwen Peng

arXiv:2408.08310·cs.CL·August 16, 2024

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Ruihang Li, Yixuan Wei, Miaosen Zhang, Nenghai Yu, Han Hu, Houwen Peng

PDF

Open Access 1 Video

TL;DR

ScalingFilter is a novel data quality assessment method for large language models that uses perplexity differences between models to improve downstream performance without relying on reference datasets.

Contribution

It introduces a reference-free data filtering approach based on inverse scaling law utilization, enhancing model performance and diversity.

Findings

01

Improves zero-shot performance on downstream tasks.

02

Balances dataset quality and semantic diversity effectively.

03

Eliminates bias from reference datasets in data filtering.

Abstract

High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws· underline

Taxonomy

TopicsData Mining Algorithms and Applications · Data Quality and Management · Big Data and Business Intelligence