D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery
Hanane Nour Moussa, Yifei Li, Zhuoyang Li, Yankai Yang, Cheng Tang, Tianshu Zhang, Nesreen K. Ahmed, Ali Payani, Ziru Chen, Huan Sun

TL;DR
D3-Gym is a new dataset of 565 real-world scientific tasks with verifiable environments, designed to advance data-driven discovery by providing executable environments, reference solutions, and evaluation scripts.
Contribution
It introduces the first automatically constructed dataset with verifiable environments for scientific discovery, including a comprehensive evaluation of its verification signals.
Findings
Evaluation scripts achieve 87.5% agreement with human standards.
Training on D3-Gym improves model performance on ScienceAgentBench.
D3-Gym significantly reduces the gap with proprietary models.
Abstract
Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks. To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
