RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Baolong Bi; Shenghua Liu; Xingzhang Ren; Dayiheng Liu; Junyang Lin; Yiwei Wang; Lingrui Mei; Junfeng Fang; Jiafeng Guo; Xueqi Cheng

arXiv:2507.03253·cs.CL·July 10, 2025

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng

PDF

Open Access 4 Reviews

TL;DR

RefineX introduces a scalable, expert-guided programmatic approach for fine-grained pre-training data refinement in large language models, improving quality and downstream performance efficiently.

Contribution

It presents a novel, high-precision data refinement framework that distills expert-guided edits into minimal programs, enabling scalable and effective data enhancement for LLM pre-training.

Findings

01

Consistently outperforms raw and filtered data across multiple tasks.

02

Achieves 2.6%-7.2% gains on 750M models.

03

Requires fewer training tokens while maintaining performance.

Abstract

The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $RefineX$ , a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The idea is simple and natural, also leads to empirical gains. - The experiments consider a bunch of filtering methods to demonstrate consistency which is nice to have. - Analysis is given for the two metrics they care about (efficiency and reliability).

Weaknesses

- The improvements are consistent but somewhat marginal. - RefineX has the overhead of sampling text2 ~ P(.|text1) compared to ProX, which compounds the issue of whether this is worth the trouble in huge scales. - This is not necessarily against the paper given limited resources, but the considered scales (<1b models, 20b tokens) seem potentially too small to draw strong conclusions. There's a possibility that the small gains here may wash out further with larger models and refined data sizes,

Reviewer 02Rating 4Confidence 5

Strengths

1. Employ a minimum edit distance heuristic to transform end-to-end refined text into several refinement programs, obtaining high-quality SFT data; 2. Introducing the DataMan quality scorer to analyse refined texts in depth.

Weaknesses

The experiments conducted in this paper are solid. However, I have a major concern regarding the fairness of the comparison to ProX-C. There are two primary sources of variance: (1) ProX-C is fine-tuned from a 0.3B from-scratch pre-trained language model trained on approximately 20B tokens, whereas RefineX is fine-tuned from Qwen-0.6B, an over-trained state-of-the-art language model; and (2) the teacher models used for synthesizing SFT data differ, with ProX-C relying on Llama-70B and RefineX us

Reviewer 03Rating 6Confidence 4

Strengths

* The two-stage "refine-then-distill" approach is a clever solution to create reliable supervision program for data cleaning. * Pre-training models from scratch against a wide range of strong baselines provides compelling evidence of the method's effectiveness.

Weaknesses

* The deletion-only constraint, while ensuring reliability, prevents the model from making other potentially valuable corrections like fixing typos or factual errors. * The paper would be more sound if more analysis and fair comparison are provided 1) against a distilled small model directly do text refinement. 2) against a LLM-based quality filter with similar inference costs. * How to build the RefineX model and how to define the evaluation metrics is the key contribution to this paper. Howeve

Reviewer 04Rating 2Confidence 5

Strengths

1. It’s fast and trustworthy because they rewrite with a big model, then keep only the delete edits, so the final program is tiny, quick to run, and doesn’t add biased or made-up text. 2. It delivers better results with less data, beating raw/rule-based/ProX baselines at 350M/750M and often matching or topping them with fewer training tokens by cutting fluff and boosting useful signal.

Weaknesses

1. Most building blocks mirror ProX’s program-based refinement (program generation/execution paradigm, chunking, minimal API). The main change is how doc→program supervision is obtained (E2E first, then deletion-only extraction), which is an incremental tweak rather than a substantive algorithmic innovation. 2. The paper does not clarify whether ProX-C/ProX-D were re-trained following Zhou et al. or taken from released models/processed corpora. If the latter, the comparison is weak: RefineX and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education