DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou

TL;DR
DSGym is a comprehensive, extensible framework designed to evaluate and train data science agents across diverse tasks, addressing limitations of existing benchmarks by providing standardized, grounded, and modular evaluation environments.
Contribution
It introduces DSGym, a modular, live testbed for data science agents with curated task suites and training capabilities, enhancing evaluation rigor and task coverage.
Findings
A curated task suite standardizes and refines existing benchmarks.
A trained 4B model outperforms GPT-4o on analysis benchmarks.
DSGym enables end-to-end evaluation of planning, implementation, and validation.
Abstract
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Cell Image Analysis Techniques
