Data Cleaning Using Large Language Models

Shuo Zhang; Zezhou Huang; Eugene Wu

arXiv:2410.15547·cs.DB·October 22, 2024

Data Cleaning Using Large Language Models

Shuo Zhang, Zezhou Huang, Eugene Wu

PDF

Open Access

TL;DR

This paper presents Cocoon, a data cleaning system that combines large language models' semantic understanding with statistical error detection, decomposing complex tasks into workflows to improve accuracy over existing methods.

Contribution

Introduction of Cocoon, a novel data cleaning system that leverages large language models and task decomposition to enhance cleaning accuracy and efficiency.

Findings

01

Cocoon outperforms state-of-the-art data cleaning systems on benchmarks.

02

Combining semantic understanding with statistical detection improves cleaning accuracy.

03

Workflow decomposition makes complex data cleaning tasks manageable.

Abstract

Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy and recall. This work introduces Cocoon, a novel data cleaning system that leverages large language models for rules based on semantic understanding and combines them with statistical error detection. However, data cleaning is still too complex a task for current LLMs to handle in one shot. To address this, we introduce Cocoon, which decomposes complex cleaning tasks into manageable components in a workflow that mimics human cleaning processes. Our experiments show that Cocoon outperforms state-of-the-art data cleaning systems on standard benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management