DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Fan Shu; Yite Wang; Ruofan Wu; Boyi Liu; Zhewei Yao; Yuxiong He; Feng Yan

arXiv:2602.24288·cs.AI·March 2, 2026

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

DARE-bench is a comprehensive, ground-truth-based benchmark for evaluating and improving LLMs' ability to perform data science tasks, addressing key gaps in existing benchmarks.

Contribution

It introduces a standardized, process-aware benchmark with verifiable ground truth and large-scale training data for LLMs in data science tasks.

Findings

01

Even advanced models struggle with data science tasks.

02

Fine-tuning on DARE-bench data significantly improves model performance.

03

Ground-truth evaluation ensures objective and reproducible assessment.

Abstract

The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

DARE-BENCH has several strengths against previous work. Unlike counterparts that only assess final-answer accuracy, DARE-BENCH uniquely evaluates both ML modeling performance and instruction fidelity, filling the void of process-aware assessment. It also provides 6,300 Kaggle-derived tasks with verifiable ground truth (reference outputs for IF tasks, original labels for MM tasks). The training data seems to be valuable. In addition, its four-stage pipeline minimizes human effort, enabling lar

Weaknesses

The task diversity is limited. It exclusively covers tabular data, lacking support for multimodal DS tasks (e.g., text-image fusion, speech-data analysis), restricting applicability to broader DS scenarios.

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper is good-writing and easy to follow. The benchmark provides comprehensive evaluation scope, specifically, it covers diverse DS tasks (including underrepresented time-series forecasting) and enforces real-world constraints (execution time, interaction turns), enhancing practical relevance. 2. DARE-BENCH serves both as an evaluation tool and a large-scale training resource, with proven effectiveness in improving LLM performance via SFT/RL.

Weaknesses

1. Lack of Comparison with Specialized DS Agents. The paper evaluates general-purpose and code-centric LLMs but omits comparisons with specialized data science agents, which are explicitly designed for multi-step DS workflows. This gap makes it hard to contextualize DARE-bench’s utility. It is unclear whether the benchmark’s gains (via fine-tuning) can match or surpass the performance of purpose-built DS agents, 2. Provide more explanations about the Instruction Following (IF) and ML Modeling (

Reviewer 03Rating 4Confidence 4

Strengths

1. It addresses two key gaps in existing benchmarks: it enables verifiable, process-aware evaluation (relying on reference-code or dataset ground truth, no human/model judges) and provides 6,300 Kaggle-derived tasks as large-scale training data, ensuring objective, reproducible assessments . 2. Its task coverage is comprehensive—covering classification, regression, time-series forecasting, with two variants (instruction-following/ML modeling) probing core DS capabilities, outperforming peers (e

Weaknesses

1. Tasks are almost exclusively tabular, excluding multimodal inputs (e.g., text-image combinations, code-diagram interactions) common in modern DS. 2. Generating large-scale executable trajectories (for training data) is costly, and rejection sampling strategies may introduce biases toward shorter trajectories.

Code & Models

Datasets

Snowflake/dare-bench
dataset· 72 dl
72 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification