Robust Group Linkage

Pei Li; Xin Luna Dong; Songtao Guo; Andrea Maurino; Divesh Srivastava

arXiv:1503.00604·cs.DB·March 3, 2015

Robust Group Linkage

Pei Li, Xin Luna Dong, Songtao Guo, Andrea Maurino, Divesh Srivastava

PDF

Open Access

TL;DR

This paper introduces a robust two-stage algorithm for group linkage that effectively handles value diversity, large data sizes, and errors, demonstrating high accuracy and efficiency on real-world datasets.

Contribution

The paper proposes a novel two-stage group linkage algorithm that improves robustness to errors and value diversity, and scales efficiently to large datasets.

Findings

01

High accuracy in linking groups in real-world data

02

Effective handling of erroneous and diverse attribute values

03

Scalable performance on large datasets

Abstract

We study the problem of group linkage: linking records that refer to entities in the same group. Applications for group linkage include finding businesses in the same chain, finding conference attendees from the same affiliation, finding players from the same team, etc. Group linkage faces challenges not present for traditional record linkage. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, groups can be huge (with tens of thousands of records), requiring high scalability even after using good blocking strategies. We present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group, while being…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data-Driven Disease Surveillance