Record Linkage to Match Customer Names: A Probabilistic Approach

Bahare Fatemi; Seyed Mehran Kazemi; David Poole

arXiv:1806.10928·cs.DB·June 29, 2018

Record Linkage to Match Customer Names: A Probabilistic Approach

Bahare Fatemi, Seyed Mehran Kazemi, David Poole

PDF

Open Access

TL;DR

This paper introduces a probabilistic relational logistic regression model for record linkage of customer names, effectively handling variations, typos, and abbreviations, and demonstrating strong performance on real-world and unseen datasets.

Contribution

The paper presents a novel probabilistic model for record linkage that outperforms existing baselines and can transfer knowledge across domains.

Findings

01

Model achieves high accuracy on real-world data.

02

Effective transferability to new datasets.

03

Robustness to dataset statistical variations.

Abstract

Consider the following problem: given a database of records indexed by names (e.g., name of companies, restaurants, businesses, or universities) and a new name, determine whether the new name is in the database, and if so, which record it refers to. This problem is an instance of record linkage problem and is a challenging problem because people do not consistently use the official name, but use abbreviations, synonyms, different order of terms, different spelling of terms, short form of terms, and the name can contain typos or spacing issues. We provide a probabilistic model using relational logistic regression to find the probability of each record in the database being the desired record for a given query and find the best record(s) with respect to the probabilities. Building on term-matching and translational approaches for search, our model addresses many of the aforementioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Semantic Web and Ontologies