CleanAgent: Automating Data Standardization with LLM-based Agents

Danrui Qi; Zhengjie Miao; Jiannan Wang

arXiv:2403.08291·cs.LG·June 3, 2025·2 cites

CleanAgent: Automating Data Standardization with LLM-based Agents

Danrui Qi, Zhengjie Miao, Jiannan Wang

PDF

Open Access 1 Repo

TL;DR

CleanAgent is a framework that automates data standardization using LLM-based agents and a Python library, simplifying the process for data scientists and reducing manual coding effort.

Contribution

It introduces a unified API library and an LLM-integrated framework that automates data standardization, reducing complexity and manual intervention.

Findings

01

Significantly reduces coding complexity for data standardization.

02

Enables hands-free, requirement-based automation with LLM agents.

03

Provides a user-friendly web application for practical use.

Abstract

Data standardization is a crucial part of the data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing different column types, simplifying the LLM's code generation with concise API calls. We first propose Dataprep.Clean, a component of the Dataprep Python Library, significantly reduces the coding complexity by enabling the standardization of specific column types with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sfu-db/CleanAgent
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Data Quality and Management

MethodsLib