Em-K Indexing for Approximate Query Matching in Large-scale ER

Samudra Herath; Matthew Roughan; Gary Glonek

arXiv:2111.04070·cs.DB·November 9, 2021

Em-K Indexing for Approximate Query Matching in Large-scale ER

Samudra Herath, Matthew Roughan, Gary Glonek

PDF

Open Access

TL;DR

This paper introduces Em-K Indexing, a novel approximate indexing method for entity resolution that uses spatial embeddings and Kd-trees to enable fast query matching on large-scale datasets.

Contribution

The paper proposes a new approximate indexing technique combining spatial embeddings and Kd-trees for efficient large-scale entity resolution query matching.

Findings

01

Effective in processing large datasets with reduced search space

02

Achieves query matching using only a small data fraction

03

Demonstrates promising results on multiple datasets

Abstract

Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching. We first use spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Management and Algorithms · Privacy-Preserving Technologies in Data