Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

Xinyuan Liu; Jiahui Chen; Bocheng Hu; Yu Sun; Xinyang Chen; Shaoxu Song; Yongxin Tong

arXiv:2507.10934·cs.DB·March 10, 2026

Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

Xinyuan Liu, Jiahui Chen, Bocheng Hu, Yu Sun, Xinyang Chen, Shaoxu Song, Yongxin Tong

PDF

Open Access

TL;DR

This paper presents TableEG, a framework using large language models to generate authentic, diverse errors in tabular data, enabling more realistic benchmarking of data cleaning techniques.

Contribution

The paper introduces TableEG, a novel LLM-based approach for generating realistic errors in tabular data, improving evaluation of error detection and correction methods.

Findings

01

Errors generated by TableEG closely mimic real-world error distributions.

02

Performance on TableEG errors aligns with real-world error detection results.

03

TableEG outperforms rule-based and non-fine-tuned LLM methods in error realism.

Abstract

Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Digital and Cyber Forensics