On the Complexity of Sorted Neighborhood
Mayank Kejriwal, Daniel P. Miranker

TL;DR
This paper analyzes the computational complexity of the Sorted Neighborhood blocking method in record linkage, proving its NP-completeness and proposing approximation algorithms based on TSP solutions.
Contribution
It establishes the NP-completeness of optimizing the Sorted Neighborhood method and introduces approximation algorithms inspired by TSP solutions.
Findings
Maximum performance optimization is NP-complete.
The NP-complete sub-problem appears in traditional blocking.
Three approximation algorithms are proposed and analyzed.
Abstract
Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on records. In a seminal work, Hernandez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted Neighborhood procedure on which the variants are built. We show that achieving maximum performance on the Sorted Neighborhood procedure entails solving a sub-problem, which is shown to be NP-complete by reducing from the Travelling Salesman Problem. We also show that the sub-problem can occur in the traditional blocking method. Finally, we draw on recent developments concerning approximate Travelling Salesman solutions to define and analyze three approximation algorithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Cryptography and Data Security
