Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, and Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, and Xu Han, Zhiyuan Liu

TL;DR
This paper introduces Ultra-FineWeb, an efficient data filtering pipeline that improves high-quality data selection for LLM training, leading to better model performance and reduced costs.
Contribution
It presents a novel, efficient verification strategy and a lightweight filtering pipeline that enhance data quality and training efficiency for large language models.
Findings
LLMs trained on Ultra-FineWeb outperform baselines on multiple benchmarks.
The filtering pipeline reduces experimental and inference costs.
High-quality data significantly boosts LLM performance.
Abstract
Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
MethodsfastText
