Wrangling Data Issues to be Wrangled: Literature Review, Taxonomy, and Industry Case Study
Qiaolin Qin, Heng Li, Ettore Merlo

TL;DR
This paper reviews existing data quality taxonomies, identifies their limitations, and proposes a new, comprehensive two-dimensional taxonomy to improve issue detection and resolution in data management.
Contribution
It introduces a novel two-dimensional taxonomy of data quality issues based on attribute and outcome dimensions, addressing overlaps and ambiguities in previous taxonomies.
Findings
Redefined categories improve clarity and mutual exclusivity.
Labeled issues reveal distribution patterns and effort estimates.
The taxonomy enhances understanding and handling of data quality problems.
Abstract
Data quality is vital for user experience in products reliant on data. As solutions for data quality problems, researchers have developed various taxonomies for different types of issues. However, although some of the existing taxonomies are near-comprehensive, the over-complexity has limited their actionability in data issue solution development. Hence, recent researchers issued new sets of data issue categories that are more concise for better usability. Although more concise, modern data issue labeling's over-catering to the solution systems may sometimes cause the taxonomy to be not mutually exclusive. Consequently, different categories sometimes overlap in determining the issue types, or the same categories share different definitions across research. This hinders solution development and confounds issue detection. Therefore, based on observations from a literature review and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
