OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu

TL;DR
OpenDataArena (ODA) is an open platform that systematically benchmarks the intrinsic value of post-training datasets for Large Language Models, promoting reproducibility and data-centric AI research.
Contribution
The paper introduces ODA, a comprehensive ecosystem with tools and frameworks for fair data evaluation, lineage analysis, and benchmarking across diverse models and datasets.
Findings
Identifies trade-offs between data complexity and task performance.
Discovers redundancy in popular datasets through lineage analysis.
Maps genealogical relationships across datasets.
Abstract
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Mixture-500kmodel· 9 dl· ♡ 29 dl♡ 2
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Mixture-100kmodel· 1 dl1 dl
- 🤗OpenDataArena/Qwen2.5-7B-ODA-Math-460kmodel· 1 dl1 dl
- 🤗OpenDataArena/Qwen3-8B-ODA-Math-460kmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗OpenDataArena/Qwen3-8B-ODA-Mixture-100kmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗OpenDataArena/Qwen3-8B-ODA-Mixture-500kmodel· 6 dl6 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education · Language and cultural evolution
