DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis; Laura Di Rocco; Wolfgang Gatterbauer; Ren\'ee; J. Miller; Mirek Riedewald

arXiv:2103.09940·cs.DB·March 24, 2021·5 cites

DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Ren\'ee, J. Miller, Mirek Riedewald

PDF

Open Access

TL;DR

This paper introduces DomainNet, a network-based method for detecting homographs in data lakes, significantly improving disambiguation accuracy over existing domain discovery techniques.

Contribution

DomainNet leverages network-centrality measures on a bipartite graph to identify homographs in data lakes without supervision, outperforming state-of-the-art domain discovery methods.

Findings

01

Achieves 69% precision on synthetic benchmarks

02

Reaches 89% precision on real data lakes

03

Outperforms existing domain discovery techniques in homograph detection

Abstract

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Graph Neural Networks