Merging datasets through deep learning

Kavitha Srinivas; Abraham Gale; Julian Dolby

arXiv:1809.01604·cs.LG·September 6, 2018·6 cites

Merging datasets through deep learning

Kavitha Srinivas, Abraham Gale, Julian Dolby

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep learning-based method for merging datasets by recognizing different surface forms of the same entity, improving over traditional string-based methods through vector space modeling and nearest neighbor search.

Contribution

The paper presents a novel deep learning approach with specialized metric learning techniques for entity surface form matching, enabling more accurate dataset merging.

Findings

01

Achieved precision@1 of 0.75-0.81 in entity matching

02

Achieved recall of 0.74-0.81 in entity matching

03

Models are publicly available for use in dataset alignment

Abstract

Merging datasets is a key operation for data analytics. A frequent requirement for merging is joining across columns that have different surface forms for the same entity (e.g., the name of a person might be represented as "Douglas Adams" or "Adams, Douglas"). Similarly, ontology alignment can require recognizing distinct surface forms of the same entity, especially when ontologies are independently developed. However, data management systems are currently limited to performing merges based on string equality, or at best using string similarity. We propose an approach to performing merges based on deep learning models. Our approach depends on (a) creating a deep learning model that maps surface forms of an entity into a set of vectors such that alternate forms for the same entity are closest in vector space, (b) indexing these vectors using a nearest neighbors algorithm to find the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yehudagale/fuzzyjoiner
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques