AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
Yifei Li, Hanane Nour Moussa, Ziru Chen, Shijie Chen, Botao Yu, Mingyi Xue, Benjamin Burns, Tzu-Yao Chiu, Vishal Dey, Zitong Lu, Chen Wei, Qianheng Zhang, Tianyu Zhang, Song Gao, Xuhui Huang, Xia Ning, Nesreen K. Ahmed, Ali Payani, Huan Sun

TL;DR
AutoSDT introduces an automated pipeline to create a large, high-quality dataset of scientific discovery tasks using LLMs, significantly improving AI co-scientist capabilities and benchmarking performance.
Contribution
The paper presents AutoSDT, a novel automatic data collection pipeline that builds the largest open dataset for data-driven scientific discovery and enhances LLM performance on related tasks.
Findings
AutoSDT-5K dataset contains 5,404 tasks across four scientific disciplines.
93% of collected tasks are ecologically valid, with 92.2% of programs functionally correct.
AutoSDT-Coder models outperform baseline models and approach GPT-4o performance on key benchmarks.
Abstract
Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Machine Learning and Data Classification
MethodsBalanced Selection
