Ultra-FineWeb: Efficient Data Filtering and Verification for   High-Quality LLM Training Data

Yudong Wang; Zixuan Fu; Jie Cai; Peijun Tang; Hongya Lyu; and Yewei Fang; Zhi Zheng; Jie Zhou; Guoyang Zeng; Chaojun Xiao; and Xu Han; Zhiyuan Liu

arXiv:2505.05427·cs.CL·May 9, 2025

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, and Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, and Xu Han, Zhiyuan Liu

PDF

Open Access 2 Models 4 Datasets

TL;DR

This paper introduces Ultra-FineWeb, an efficient data filtering pipeline that improves high-quality data selection for LLM training, leading to better model performance and reduced costs.

Contribution

It presents a novel, efficient verification strategy and a lightweight filtering pipeline that enhance data quality and training efficiency for large language models.

Findings

01

LLMs trained on Ultra-FineWeb outperform baselines on multiple benchmarks.

02

The filtering pipeline reduces experimental and inference costs.

03

High-quality data significantly boosts LLM performance.

Abstract

Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification

MethodsfastText