Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets
Xin Gao, Xiaoyang Wang, Yun Zhu, Mengzhang Cai, Conghui He, Lijun Wu

TL;DR
This paper introduces a systematic, feedback-driven framework called OpenDataArena for constructing superior training datasets for large language models, leading to state-of-the-art results and improved data efficiency.
Contribution
It presents a novel closed-loop dataset engineering paradigm using value-based rankings and multi-dimensional analysis, with new datasets demonstrating significant performance improvements.
Findings
State-of-the-art results on math benchmarks with ODA-Math-460k
Superior multi-domain instruction datasets outperform larger baselines
Enhanced data efficiency and model reasoning capabilities
Abstract
The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \&…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Mixture-500kmodel· 9 dl· ♡ 29 dl♡ 2
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Mixture-100kmodel· 1 dl1 dl
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Math-460kmodel· 1 dl1 dl
- 🤗OpenDataArena/Qwen3-8B-ODA-Math-460kmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗OpenDataArena/Qwen3-8B-ODA-Mixture-100kmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗OpenDataArena/Qwen3-8B-ODA-Mixture-500kmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
