Attribute-Based Semantic Type Detection and Data Quality Assessment
Marcelo Valentim Silva, Hannes Herrmann, Valerie Maxville

TL;DR
This paper presents a novel semantic type detection method that leverages attribute labels and rule-based analysis to improve data quality assessment across diverse datasets, outperforming existing tools in accuracy.
Contribution
The research introduces an innovative semantic type classification system using attribute labels, rule-based analysis, and dictionaries, enhancing data quality detection and robustness.
Findings
Achieved 81 missing value detections in 922 attributes, outperforming YData Profiling.
Demonstrated superior accuracy and applicability over Sherlock in classifying semantic types.
Validated effectiveness across fifty datasets from the UCI repository.
Abstract
The reliance on data-driven decision-making across sectors highlights the critical need for high-quality data; despite advancements, data quality issues persist, significantly impacting business strategies and scientific research. Current data quality methods fail to leverage the semantic richness embedded in words inside attribute labels (or column names/headers in tables) across diverse datasets and domains, leaving a crucial gap in comprehensive data quality evaluation. This research addresses this gap by introducing an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system comprising approximately 23 types,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Semantic Web and Ontologies
