A Catalog of Data Errors
Divya Bhadauria, Hazar Harmouch, Felix Naumann, Divesh Srivastava, Lisa Ehrlinger

TL;DR
This paper presents a comprehensive catalog of 35 data error types in tabular data, providing formal definitions and examples to aid detection and correction in data quality practices.
Contribution
It introduces a detailed taxonomy of data errors, including both errors and indicators, with formal definitions to improve error detection and data cleaning strategies.
Findings
Catalog includes 35 distinct error types and indicators.
Errors are classified into missing, incorrect, and redundant categories.
Provides formal definitions and practical examples for each error type.
Abstract
Data errors are widespread in real-world databases and severely impact downstream applications, such as machine learning pipelines or business analytics reports. Causes of such errors are manifold and can arise during both the design phase and the operational phase of a database. Some error types, such as missing values, duplicate tuples, or constraint violations, are widely recognized; others, such as disguised missing values or word transpositions, remain underexplored. Existing attempts to define and classify errors in data offer valuable but limited taxonomies, mostly informal and not covering the full range of error types. With the rise of AI, practitioners must increasingly detect and correct statistical errors such as bias and outliers, which are rarely considered within existing error taxonomies. This catalog presents a comprehensive list of 35 distinct error types, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
