DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang, Yingzhuo Yu, Jin Huang, Xingjian Zhang, Jiaqi Ma

TL;DR
This paper introduces DCA-Bench, a benchmark for evaluating dataset curation agents using real-world test cases and an automatic GPT-4o-based evaluation, highlighting the challenges faced by LLMs in detecting data quality issues.
Contribution
The work establishes a benchmark with curated test cases and an evaluation framework to assess LLM agents' ability to identify dataset issues in real-world scenarios.
Findings
Most LLM agents detect only about 30% of data issues without hints.
The benchmark reveals the complexity of dataset curation for LLMs.
GPT-4o evaluation aligns well with expert assessments.
Abstract
The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, therefore requiring identification and verification by dataset users or maintainers--a process that is both time-consuming and prone to human mistakes. With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents. To achieve this, one significant challenge is enabling LLM agents to detect issues in the wild rather than simply fixing known ones. In this work, we establish…
Peer Reviews
Decision·Submitted to ICLR 2025
S1: the exploration of data quality issues is an interesting and important problem. If LLMs are able to automate the detection of data quality issues, they will be helpful for automatic maintenance of data publishing platform. S2: The dataset collection process seems reasonable. Four different types of issues are annotated, representing the typical issues that may happen in public data platforms. S3: The evaluation system leverages LLMs in multiple perspectives for implementing different evalua
W1: the Evaluator is designed to leverage LLMs as judges for assessing whether the detection results are correct or not. However, as authors mentioned: the annotated test data for the Evaluator are collected after the prompt design for the Evaluator. This raises concerns that the test data may simply align with the designed prompts, potentially indicating that the Evaluator’s performance is optimized for this specific data. This brings into question about how the proposed prompt design of the
- The work addresses a significant problem in AI, as dataset quality is critical for robust model performance and reliable research outcomes. - The work provides a foundation for future research into fully autonomous dataset curation systems, which could save time and reduce errors in data management. - The framework offers different hint levels to evaluate LLMs at multiple stages of problem discovery, providing nuanced insights into model capabilities. - DCA-Bench constructs an automatic evalua
- There is a lack of a fully realistic testing environment. While comprehensive, the test cases may not fully represent the diversity of data quality issues encountered in real-world curation. - Performance is heavily dependent on hints. Without hints, the baseline LLM agent detected only 11% of issues, indicating a limited ability to autonomously identify dataset problems. - The automatic evaluation pipeline, while effective, may introduce subtle biases that differ from nuanced human judgment.
1. Innovative Concept: The paper presents a novel approach by shifting focus from issue-solving to issue-discovery in dataset curation. 2. Comprehensive Coverage: DCA-Bench provides a broad testing ground with diverse curated test cases, representing real-world scenarios across different platforms. 3. Automatic Evaluation Framework: The use of GPT-4 for evaluation is a practical advancement, offering scalability where human annotation is not feasible. 4. Clear Practical Relevance: Tackling hidde
1. Limited Test Set Size: With only 221 samples, and limited instances in some categories (e.g., only 10 ethical instances), the test set might not sufficiently capture the complexity of real-world data quality issues. Increasing the dataset size and diversity could improve robustness. 2. Missing Related Works: The paper does not adequately engage with existing literature regarding data system and LLM agents, such as [arXiv.2402.02643, LLM-Enhanced Data Management], [SIGMOD’24, Data-juicer: A on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices
