ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, and Yotam Perlitz

TL;DR
This paper revises the ELT-Bench benchmark by identifying and correcting quality issues, revealing that AI agents are more capable in ELT pipeline tasks than previously estimated.
Contribution
It introduces ELT-Bench-Verified, a refined benchmark with improved evaluation and ground truth, demonstrating more accurate assessment of AI capabilities in data engineering.
Findings
Re-evaluation shows significant performance improvement after benchmark correction.
Most failed tasks were due to benchmark errors, not agent incapability.
Benchmark quality issues are systemic in data engineering evaluation.
Abstract
Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
