Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents
Josefa Lia Stoisser, Marc Boubnovski Martell, Sidsel Boldsen, Kaspar M\"artens, Robert Kitchen

TL;DR
This paper introduces Ambig-DS, a benchmark suite for evaluating how data-science agents handle task-framing ambiguity, revealing silent failures and the importance of recognizing underspecified tasks.
Contribution
Ambig-DS provides diagnostic suites for prediction-target and evaluation-objective ambiguity, highlighting silent failure modes and the impact of framing issues in data-science agents.
Findings
Ambiguity causes silent failures rather than execution errors.
Allowing one clarifying question recovers much performance loss.
Agents struggle to determine when to ask for clarification.
Abstract
As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
