Merging datasets through deep learning
Kavitha Srinivas, Abraham Gale, Julian Dolby

TL;DR
This paper introduces a deep learning-based method for merging datasets by recognizing different surface forms of the same entity, improving over traditional string-based methods through vector space modeling and nearest neighbor search.
Contribution
The paper presents a novel deep learning approach with specialized metric learning techniques for entity surface form matching, enabling more accurate dataset merging.
Findings
Achieved precision@1 of 0.75-0.81 in entity matching
Achieved recall of 0.74-0.81 in entity matching
Models are publicly available for use in dataset alignment
Abstract
Merging datasets is a key operation for data analytics. A frequent requirement for merging is joining across columns that have different surface forms for the same entity (e.g., the name of a person might be represented as "Douglas Adams" or "Adams, Douglas"). Similarly, ontology alignment can require recognizing distinct surface forms of the same entity, especially when ontologies are independently developed. However, data management systems are currently limited to performing merges based on string equality, or at best using string similarity. We propose an approach to performing merges based on deep learning models. Our approach depends on (a) creating a deep learning model that maps surface forms of an entity into a set of vectors such that alternate forms for the same entity are closest in vector space, (b) indexing these vectors using a nearest neighbors algorithm to find the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques
