Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement

Xiaonan Luo; Yue Huang; Ping He; Xiangliang Zhang

arXiv:2511.06530·cs.CL·November 11, 2025

Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement

Xiaonan Luo, Yue Huang, Ping He, Xiangliang Zhang

PDF

Open Access 1 Video

TL;DR

RefineLab is an LLM-driven framework that automatically refines QA datasets to improve quality attributes like coverage and factual accuracy within a token budget, enhancing dataset reliability for LLM evaluation.

Contribution

This work introduces RefineLab, the first framework for automatic, controllable QA dataset refinement using LLMs under resource constraints.

Findings

01

RefineLab reduces divergence from expert datasets across multiple quality metrics.

02

It effectively balances quality improvements with token budget limitations.

03

The framework demonstrates broad applicability for scalable dataset enhancement.

Abstract

High-quality Question-Answer (QA) datasets are foundational for reliable Large Language Model (LLM) evaluation, yet even expert-crafted datasets exhibit persistent gaps in domain coverage, misaligned difficulty distributions, and factual inconsistencies. The recent surge in generative model-powered datasets has compounded these quality challenges. In this work, we introduce RefineLab, the first LLM-driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token-budget constraint. RefineLab takes a set of target quality attributes (such as coverage and difficulty balance) as refinement objectives, and performs selective edits within a predefined token budget to ensure practicality and efficiency. In essence, RefineLab addresses a constrained optimization problem: improving the quality of QA samples as much as possible while respecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Better Datasets Start from RefineLab: Automatic Optimization for High-Quality Dataset Refinement· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods