Better Datasets Start From RefineLab: Automatic Optimization for High-Quality Dataset Refinement
Xiaonan Luo, Yue Huang, Ping He, Xiangliang Zhang

TL;DR
RefineLab is an LLM-driven framework that automatically refines QA datasets to improve quality attributes like coverage and factual accuracy within a token budget, enhancing dataset reliability for LLM evaluation.
Contribution
This work introduces RefineLab, the first framework for automatic, controllable QA dataset refinement using LLMs under resource constraints.
Findings
RefineLab reduces divergence from expert datasets across multiple quality metrics.
It effectively balances quality improvements with token budget limitations.
The framework demonstrates broad applicability for scalable dataset enhancement.
Abstract
High-quality Question-Answer (QA) datasets are foundational for reliable Large Language Model (LLM) evaluation, yet even expert-crafted datasets exhibit persistent gaps in domain coverage, misaligned difficulty distributions, and factual inconsistencies. The recent surge in generative model-powered datasets has compounded these quality challenges. In this work, we introduce RefineLab, the first LLM-driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token-budget constraint. RefineLab takes a set of target quality attributes (such as coverage and difficulty balance) as refinement objectives, and performs selective edits within a predefined token budget to ensure practicality and efficiency. In essence, RefineLab addresses a constrained optimization problem: improving the quality of QA samples as much as possible while respecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
