FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Erik Henriksson; Otto Tarkka; Filip Ginter

arXiv:2501.07314·cs.CL·January 14, 2025

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Erik Henriksson, Otto Tarkka, Filip Ginter

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper presents an LLM-based line-level filtering method that improves data quality for training large language models, leading to better performance and efficiency, and releases a new annotated dataset for the community.

Contribution

Introduces a novel LLM-based line filtering approach and a labeled dataset to enhance training data quality for LLMs.

Findings

01

Filtered data improves GPT-2 model accuracy on HellaSwag

02

Models trained on filtered data reach performance targets faster

03

Filtering reduces data volume by up to 25% without performance loss

Abstract

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

turkunlp/finerweb-10bt
pytorchOfficial

Models

🤗
TurkuNLP/finerweb-quality-classifier
model· 28 dl· ♡ 4
28 dl♡ 4

Datasets

TurkuNLP/finerweb-10bt
dataset· 329 dl
329 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Residual Connection · Dropout · Linear Layer · Linear Warmup With Cosine Annealing · Weight Decay · Multi-Head Attention