Data Wrangling Task Automation Using Code-Generating Language Models

Ashlesha Akella; Krishnasuri Narayanam

arXiv:2502.15732·cs.LG·February 25, 2025

Data Wrangling Task Automation Using Code-Generating Language Models

Ashlesha Akella, Krishnasuri Narayanam

PDF

1 Video

TL;DR

This paper introduces an automated data wrangling system that employs large language models to generate executable code, improving data quality tasks like error detection and correction without extensive training.

Contribution

The novel system leverages large language models to automate data cleaning tasks, combining pattern recognition and external knowledge for versatile data quality improvements.

Findings

01

Effective code generation for data cleaning tasks

02

Addresses both memory-dependent and independent tasks

03

Reduces need for task-specific training

Abstract

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resource-intensive, requiring task and dataset-specific training. To overcome these shortcomings, we present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Data Wrangling Task Automation Using Code-Generating Language Models· underline