Maximum Entropy classification for record linkage
Danhyang Lee, Li-Chun Zhang, Jae-Kwang Kim

TL;DR
This paper introduces a maximum entropy classification approach to record linkage, improving scalability and automation over classical methods by handling uncertainty and reducing clerical review.
Contribution
It adapts maximum entropy classification to record linkage, addressing theoretical flaws and enabling scalable, automatic linking in supervised and unsupervised settings.
Findings
Overcomes theoretical flaws of classical record linkage methods
Provides a scalable, automatic algorithm for record linkage
Handles uncertainty effectively in linking decisions
Abstract
By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in text mining to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is scalable and fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Data Mining Algorithms and Applications
