Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Xin Gao; Xiaoyang Wang; Yun Zhu; Mengzhang Cai; Conghui He; Lijun Wu

arXiv:2601.09733·cs.CL·January 16, 2026

Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu

PDF

Open Access 6 Models 5 Datasets

TL;DR

This paper introduces a systematic, feedback-driven framework called OpenDataArena for constructing superior training datasets for large language models, leading to state-of-the-art results and improved data efficiency.

Contribution

It presents a novel closed-loop dataset engineering paradigm using value-based rankings and multi-dimensional analysis, with new datasets demonstrating significant performance improvements.

Findings

01

State-of-the-art results on math benchmarks with ODA-Math-460k

02

Superior multi-domain instruction datasets outperform larger baselines

03

Enhanced data efficiency and model reasoning capabilities

Abstract

The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \&…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification