Fast Learning from Distributed Datasets without Entity Matching
Giorgio Patrini, Richard Nock, Stephen Hardy, Tiberio Caetano

TL;DR
This paper introduces an end-to-end method for learning classifiers from distributed, anonymized datasets without entity matching, using Rademacher observations to improve efficiency and handle data partitioning.
Contribution
It proposes a novel approach that bypasses entity resolution by leveraging Rademacher observations, reducing computational complexity and enabling learning in more general data partitioning scenarios.
Findings
The method avoids explicit entity matching, simplifying the data fusion process.
It achieves lower time and space complexity compared to traditional approaches.
Experiments show it can outperform the best peer in certain settings.
Abstract
Consider the following data fusion scenario: two datasets/peers contain the same real-world entities described using partially shared features, e.g. banking and insurance company records of the same customer base. Our goal is to learn a classifier in the cross product space of the two domains, in the hard case in which no shared ID is available -- e.g. due to anonymization. Traditionally, the problem is approached by first addressing entity matching and subsequently learning the classifier in a standard manner. We present an end-to-end solution which bypasses matching entities, based on the recently introduced concept of Rademacher observations (rados). Informally, we replace the minimisation of a loss over examples, which requires to solve entity resolution, by the equivalent minimisation of a (different) loss over rados. Among others, key properties we show are (i) a potentially huge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Advanced Database Systems and Queries
