KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Eugenie Lai; Gerardo Vitagliano; Ziyu Zhang; Om Chabra; Sivaprasad Sudhir; Anna Zeng; Anton A. Zabreyko; Chenning Li; Ferdi Kossmann; Jialin Ding; Jun Chen; Markos Markakis; Matthew Russo; Weiyang Wang; Ziniu Wu; Michael J. Cafarella; Lei Cao; Samuel Madden; Tim Kraska

arXiv:2506.06541·cs.DB·March 9, 2026

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Kramabench is a comprehensive benchmark designed to evaluate AI systems' ability to automate complex data-to-insight pipelines over data lakes, revealing current limitations in end-to-end pipeline generation.

Contribution

This paper introduces KramaBench, a new benchmark with curated challenges and an evaluation framework for assessing AI systems' capabilities in data pipeline automation.

Findings

01

Current AI systems achieve only 55% end-to-end accuracy.

02

Even with perfect data retrieval, accuracy peaks at 62%.

03

Leading LLMs identify 42% of data tasks but fully implement only 20%.

Abstract

Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

See Summary

Weaknesses

See Summary

Reviewer 02Rating 4Confidence 4

Strengths

- **Comprehensive Evaluation**: The multi-level evaluation framework (end-to-end, pipeline design, sub-task implementation) provides valuable insights into where systems fail. - **Rigorous Curation Process**: The 4-step validation process involving multiple contributors, cross-validation, and manual verification of reference solutions demonstrates strong quality control. Grounding tasks in published studies ensures real-world relevance and avoids artificial task design. - **Diverse Tasks**: The

Weaknesses

- **Unclear Motivation and Weak Problem Positioning**: The paper lacks a compelling motivation section explaining why this benchmark is needed now and what specific real-world problems it addresses that existing benchmarks cannot. The introduction jumps directly into the solution without establishing the problem's urgency or providing concrete use cases where current benchmarks fail. - **Poor Figure Quality**: Figure 1 is a low-resolution raster image with blurry, barely readable text, which is

Reviewer 03Rating 4Confidence 3

Strengths

1. Fills a critical gap in existing benchmarks (e.g., DS-1000, ARCADE) by focusing on end-to-end data lake processing rather than isolated tasks (code generation, text-to-SQL). Unlike prior work, it emphasizes real-world complexity (noisy data, multi-file integration, domain-specific knowledge) and requires systems to orchestrate all pipeline stages (discovery, cleaning, analysis). 2. Rigorous task curation via a 4-step validation process (curation → cross-contributor verification → key function

Weaknesses

1. The paper focuses on single-agent systems (DS-Guru, smolagents DR) but barely explores multi-agent architectures, which are increasingly proposed for complex data tasks (e.g., dividing pipeline stages across specialized agents). This is a missed opportunity, as multi-agent systems may mitigate single-agent limitations (e.g., context window constraints, heterogeneous skill requirements). 2. The results show large performance gaps across domains (e.g., smolagents DR achieves 60% accuracy in Env

Code & Models

Repositories

mitdbg/kramabench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Scientific Computing and Data Management · Research Data Management Practices