TL;DR
This paper introduces an automated data wrangling system that employs large language models to generate executable code, improving data quality tasks like error detection and correction without extensive training.
Contribution
The novel system leverages large language models to automate data cleaning tasks, combining pattern recognition and external knowledge for versatile data quality improvements.
Findings
Effective code generation for data cleaning tasks
Addresses both memory-dependent and independent tasks
Reduces need for task-specific training
Abstract
Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning approaches are resource-intensive, requiring task and dataset-specific training. To overcome these shortcomings, we present an automated system that utilizes large language models to generate executable code for tasks like missing value imputation, error detection, and error correction. Our system aims to identify inherent patterns in the data while leveraging external knowledge, effectively addressing both memory-dependent and memory-independent tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
