DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan

TL;DR
DARE-bench is a comprehensive, ground-truth-based benchmark for evaluating and improving LLMs' ability to perform data science tasks, addressing key gaps in existing benchmarks.
Contribution
It introduces a standardized, process-aware benchmark with verifiable ground truth and large-scale training data for LLMs in data science tasks.
Findings
Even advanced models struggle with data science tasks.
Fine-tuning on DARE-bench data significantly improves model performance.
Ground-truth evaluation ensures objective and reproducible assessment.
Abstract
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show…
Peer Reviews
Decision·ICLR 2026 Poster
DARE-BENCH has several strengths against previous work. Unlike counterparts that only assess final-answer accuracy, DARE-BENCH uniquely evaluates both ML modeling performance and instruction fidelity, filling the void of process-aware assessment. It also provides 6,300 Kaggle-derived tasks with verifiable ground truth (reference outputs for IF tasks, original labels for MM tasks). The training data seems to be valuable. In addition, its four-stage pipeline minimizes human effort, enabling lar
The task diversity is limited. It exclusively covers tabular data, lacking support for multimodal DS tasks (e.g., text-image fusion, speech-data analysis), restricting applicability to broader DS scenarios.
1. The paper is good-writing and easy to follow. The benchmark provides comprehensive evaluation scope, specifically, it covers diverse DS tasks (including underrepresented time-series forecasting) and enforces real-world constraints (execution time, interaction turns), enhancing practical relevance. 2. DARE-BENCH serves both as an evaluation tool and a large-scale training resource, with proven effectiveness in improving LLM performance via SFT/RL.
1. Lack of Comparison with Specialized DS Agents. The paper evaluates general-purpose and code-centric LLMs but omits comparisons with specialized data science agents, which are explicitly designed for multi-step DS workflows. This gap makes it hard to contextualize DARE-bench’s utility. It is unclear whether the benchmark’s gains (via fine-tuning) can match or surpass the performance of purpose-built DS agents, 2. Provide more explanations about the Instruction Following (IF) and ML Modeling (
1. It addresses two key gaps in existing benchmarks: it enables verifiable, process-aware evaluation (relying on reference-code or dataset ground truth, no human/model judges) and provides 6,300 Kaggle-derived tasks as large-scale training data, ensuring objective, reproducible assessments . 2. Its task coverage is comprehensive—covering classification, regression, time-series forecasting, with two variants (instruction-following/ML modeling) probing core DS capabilities, outperforming peers (e
1. Tasks are almost exclusively tabular, excluding multimodal inputs (e.g., text-image combinations, code-diagram interactions) common in modern DS. 2. Generating large-scale executable trajectories (for training data) is costly, and rejection sampling strategies may introduce biases toward shorter trajectories.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification
