Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models
Xinyuan Liu, Jiahui Chen, Bocheng Hu, Yu Sun, Xinyang Chen, Shaoxu Song, Yongxin Tong

TL;DR
This paper presents TableEG, a framework using large language models to generate authentic, diverse errors in tabular data, enabling more realistic benchmarking of data cleaning techniques.
Contribution
The paper introduces TableEG, a novel LLM-based approach for generating realistic errors in tabular data, improving evaluation of error detection and correction methods.
Findings
Errors generated by TableEG closely mimic real-world error distributions.
Performance on TableEG errors aligns with real-world error detection results.
TableEG outperforms rule-based and non-fine-tuned LLM methods in error realism.
Abstract
Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Digital and Cyber Forensics
