AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Lan Li; Liri Fang; Bertram Lud\"ascher; Vetle I. Torvik

arXiv:2412.06724·cs.DB·August 26, 2025·2 cites

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Lan Li, Liri Fang, Bertram Lud\"ascher, Vetle I. Torvik

PDF

Open Access

TL;DR

AutoDCWorkflow leverages large language models to automatically generate data-cleaning workflows tailored to specific purposes, significantly improving data quality and matching human-curated processes across diverse datasets.

Contribution

The paper introduces AutoDCWorkflow, a novel LLM-based pipeline for automatic data cleaning workflow generation and a benchmark for evaluating such workflows.

Findings

01

LLMs like Gemma 2-27B produce high-quality cleaned tables.

02

Gemma 2-9B generates workflows closely resembling human annotations.

03

AutoDCWorkflow outperforms baseline methods across multiple metrics.

Abstract

Data cleaning is a time-consuming and error-prone manual process, even with modern workflow tools such as OpenRefine. We present AutoDCWorkflow, an LLM-based pipeline for automatically generating data-cleaning workflows. The pipeline takes a raw table and a data analysis purpose, and generates a sequence of OpenRefine operations designed to produce a minimal, clean table sufficient to address the purpose. Six operations correspond to common data quality issues, including format inconsistencies, type errors, and duplicates. To evaluate AutoDCWorkflow, we create a benchmark with metrics assessing answers, data, and workflow quality for 142 purposes using 96 tables across six topics. The evaluation covers three key dimensions: (1) Purpose Answer: can the cleaned table produce a correct answer? (2) Column (Value): how closely does it match the ground truth table? (3) Workflow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems

MethodsSparse Evolutionary Training