AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark
Lan Li, Liri Fang, Bertram Lud\"ascher, Vetle I. Torvik

TL;DR
AutoDCWorkflow leverages large language models to automatically generate data-cleaning workflows tailored to specific purposes, significantly improving data quality and matching human-curated processes across diverse datasets.
Contribution
The paper introduces AutoDCWorkflow, a novel LLM-based pipeline for automatic data cleaning workflow generation and a benchmark for evaluating such workflows.
Findings
LLMs like Gemma 2-27B produce high-quality cleaned tables.
Gemma 2-9B generates workflows closely resembling human annotations.
AutoDCWorkflow outperforms baseline methods across multiple metrics.
Abstract
Data cleaning is a time-consuming and error-prone manual process, even with modern workflow tools such as OpenRefine. We present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes a raw table and a data analysis purpose, and generates a sequence of OpenRefine operations designed to produce a minimal, clean table sufficient to address the purpose. Six operations correspond to common data quality issues, including format inconsistencies, type errors, and duplicates. To evaluate AutoDCWorkflow, we create a benchmark with metrics assessing answers, data, and workflow quality for 142 purposes using 96 tables across six topics. The evaluation covers three key dimensions: (1) Purpose Answer: can the cleaned table produce a correct answer? (2) Column (Value): how closely does it match the ground truth table? (3) Workflow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems
MethodsSparse Evolutionary Training
