Detecting Quality Problems in Data Models by Clustering Heterogeneous   Data Values

Viola Wenz; Arno Kesper; Gabriele Taentzer

arXiv:2111.06661·cs.LG·November 15, 2021

Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values

Viola Wenz, Arno Kesper, Gabriele Taentzer

PDF

Open Access

TL;DR

This paper presents a bottom-up clustering approach to identify data quality issues caused by heterogeneity in data values, aiding domain experts in understanding and improving data models.

Contribution

It introduces a novel method for detecting data model quality problems through clustering heterogeneous data values, supporting domain expert analysis.

Findings

01

Effective in revealing data heterogeneity in practice

02

Supports domain experts in identifying data quality issues

03

Validated on cultural heritage data

Abstract

Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability and interoperability are impaired. This heterogeneity may be caused by quality problems in the data model. Data heterogeneity can occur in particular when the information given is not structured enough and just captured in data values, often due to missing or non-suitable structure in the underlying data model. We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values. It supports an explorative analysis of the existing data and can be configured by domain experts according to their domain knowledge. All values of a selected data field are clustered by syntactic similarity. Thereby an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Data Quality and Management · Advanced Database Systems and Queries