AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo; Jin Du; Xun Xian; Robert Specht; Fangqiao Tian; Ganghua Wang; Xuan Bi; Charles Fleming; Ashish Kundu; Jayanth Srinivasa; Mingyi Hong; Rui Zhang; Tianxi Li; Galin Jones; Jie Ding

arXiv:2603.19005·cs.LG·March 20, 2026

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding

PDF

Open Access 5 Datasets

TL;DR

This paper introduces AgentDS, a benchmark and competition to evaluate AI and human-AI collaboration in domain-specific data science, revealing current AI limitations and the continued importance of human expertise.

Contribution

It presents a new benchmark and competition for assessing AI and human-AI collaboration in domain-specific data science tasks, with insights into AI performance and future directions.

Findings

01

AI agents struggle with domain-specific reasoning

02

Human-AI collaboration outperforms AI-only approaches

03

Current AI performance is near or below median of human teams

Abstract

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling