DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu

TL;DR
DAComp is a comprehensive benchmark of 210 tasks designed to evaluate data agents across the entire data intelligence lifecycle, exposing significant performance gaps in current AI systems for enterprise data workflows.
Contribution
This paper introduces DAComp, the first benchmark covering both data engineering and analysis tasks, with novel evaluation methods including multi-metric scoring and LLM-based judgment.
Findings
State-of-the-art agents achieve success rates below 20% on engineering tasks.
Performance on open-ended analysis tasks averages below 40%.
Current AI systems show critical limitations in holistic data pipeline orchestration.
Abstract
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge,…
Peer Reviews
Decision·ICLR 2026 Poster
- DAComp is the first benchmark to unify data engineering and analysis within a single evaluation framework. - The benchmark construction pipeline is rigorous. - The paper is dense but logically structured.
- While the LLM-judge method is well-validated, DAComp’s rubric framework is static. Whether adaptive rubric refinement is required? - The experiments show low success rates but do not deeply isolate why orchestration fails? - The benchmark assumes one-shot or fixed-turn interactions. Yet many enterprise agents operate iteratively. DAComp currently lacks tasks or metrics reflecting closed-loop self-correction, which might undervalue agents with strong iterative reasoning skills. - CS/CFS/SR
1. **Holistic scope** – By covering both repository‑level engineering and open‑ended analysis, DAComp fills a clear gap in existing agent benchmarks that usually focus on isolated code generation or single‑turn QA. 2. **Realistic task design** – Tasks are built from 73 permissively‑licensed enterprise SaaS schemas, with synthetic data that respects realistic column distributions, referential integrity, and edge‑case noise. The DE‑Impl/Evol tasks involve multi‑file, multi‑layer pipelines (≈ 46
- While the authors report strong correlations with human ratings, the statistical choices and reporting could be more rigorous. Using Pearson r on rubric sums (often ordinal/heterogeneous across tasks) is suboptimal; intraclass correlation (ICC) or Kendall’s tau for ranking, plus confidence intervals, would be preferable. - Weighted κ=65 for GSB indicates only moderate agreement; the paper calls overall alignment “high” but should temper claims or provide confidence intervals and calibration pl
Originality: Defines a new, holistic problem space combining engineering and analysis in realistic data workflows. Methodological rigor: Combines automatic and rubric-based evaluations, validated with human judgments. Scale and realism: Repository-level tasks (>4k LOC) and open-ended business questions reflect real enterprise workloads. Comprehensive evaluation: Benchmarks diverse models, offering insights into distinct skill domains (engineering vs reasoning). Actionable findings: Identifie
Limited exploration of agent learning improvements: The paper benchmarks performance but does not propose or test training strategies to overcome observed limitations. Restricted open-source availability: While the paper claims data/code release, the double-blind setup prevents verification; explicit examples of released task formats would strengthen reproducibility. Evaluation cost: The LLM-as-judge pipeline, though validated, may pose high computational costs for community replication; appro
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Scientific Computing and Data Management · Data Quality and Management
