Record fusion: A learning approach

Alireza Heidari; George Michalopoulos; Shrinu Kushagra; Ihab F. Ilyas,; Theodoros Rekatsinas

arXiv:2006.10208·cs.LG·June 19, 2020·6 cites

Record fusion: A learning approach

Alireza Heidari, George Michalopoulos, Shrinu Kushagra, Ihab F. Ilyas,, Theodoros Rekatsinas

PDF

Open Access

TL;DR

This paper introduces a machine learning approach for record fusion that leverages a novel stagewise additive model to accurately merge database records representing the same entity, achieving high precision.

Contribution

The paper presents a new stagewise additive learning model for record fusion, combining multiple signals and deep transformations to improve accuracy over existing methods.

Findings

01

Achieves ~98% precision with source info

02

Achieves ~94% precision without source info

03

Outperforms existing data fusion methods by 20-45%

Abstract

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Anomaly Detection Techniques and Applications · Data-Driven Disease Surveillance

MethodsSoftmax