DomainNet: Homograph Detection for Data Lake Disambiguation
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Ren\'ee, J. Miller, Mirek Riedewald

TL;DR
This paper introduces DomainNet, a network-based method for detecting homographs in data lakes, significantly improving disambiguation accuracy over existing domain discovery techniques.
Contribution
DomainNet leverages network-centrality measures on a bipartite graph to identify homographs in data lakes without supervision, outperforming state-of-the-art domain discovery methods.
Findings
Achieves 69% precision on synthetic benchmarks
Reaches 89% precision on real data lakes
Outperforms existing domain discovery techniques in homograph detection
Abstract
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Graph Neural Networks
