FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai

TL;DR
FIRE is a scalable framework that combines multiple data quality signals to improve data selection for pretraining large language models, leading to better performance with less data.
Contribution
We introduce FIRE, a novel method that integrates diverse quality ratings into a unified assessment, enhancing data selection for pretraining LLMs.
Findings
FIRE outperforms existing data selection methods.
FIRE achieves comparable performance with less than 37.5% of training data.
FIRE significantly boosts downstream task performance.
Abstract
Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Quality and Management · Medical Coding and Health Information
