Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values
Viola Wenz, Arno Kesper, Gabriele Taentzer

TL;DR
This paper presents a bottom-up clustering approach to identify data quality issues caused by heterogeneity in data values, aiding domain experts in understanding and improving data models.
Contribution
It introduces a novel method for detecting data model quality problems through clustering heterogeneous data values, supporting domain expert analysis.
Findings
Effective in revealing data heterogeneity in practice
Supports domain experts in identifying data quality issues
Validated on cultural heritage data
Abstract
Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability and interoperability are impaired. This heterogeneity may be caused by quality problems in the data model. Data heterogeneity can occur in particular when the information given is not structured enough and just captured in data values, often due to missing or non-suitable structure in the underlying data model. We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values. It supports an explorative analysis of the existing data and can be configured by domain experts according to their domain knowledge. All values of a selected data field are clustered by syntactic similarity. Thereby an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Data Quality and Management · Advanced Database Systems and Queries
