FLUX: Data Worth Training On
Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya

TL;DR
FLUX is a novel data preprocessing pipeline that maximizes token retention and quality, enabling more efficient training of large language models with improved performance and reduced compute requirements.
Contribution
FLUX introduces a new preprocessing method that balances high data quality with large-scale token retention, surpassing prior approaches in efficiency and model performance.
Findings
Models trained on FLUX data outperform previous state-of-the-art pipelines.
FLUX reduces training compute by 34.4% while maintaining or improving accuracy.
FLUX extracts more usable tokens from data sources than existing methods.
Abstract
Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
