Consistent data fusion with Parker

Antoon Bronselaer; Maribel Acosta

arXiv:2202.12184·cs.DB·March 29, 2024·1 cites

Consistent data fusion with Parker

Antoon Bronselaer, Maribel Acosta

PDF

Open Access 2 Repos

TL;DR

This paper introduces Parker, an efficient data repair engine that uses new constraints called edit rules under a partial key to quickly and effectively resolve inconsistencies in multi-source data fusion.

Contribution

The paper proposes a novel constraint model (EPKs) and adapts set cover algorithms to create Parker, significantly improving repair speed while maintaining or enhancing repair quality.

Findings

01

Parker is several orders of magnitude faster than existing tools.

02

Repair quality with Parker is comparable or better in F1-score.

03

The approach effectively models intra- and inter-source inconsistencies.

Abstract

When combining data from multiple sources, inconsistent data complicates the production of a coherent result. In this paper, we introduce a new type of constraints called edit rules under a partial key (EPKs). These constraints can model inconsistencies both within and between sources, but in a loosely-coupled matter. We show that we can adapt the well-known set cover methodology to the setting of EPKs and this yields an efficient algorithm to find minimal cost repairs of sources. This algorithm is implemented in a repair engine called Parker. Empirical results show that Parker is several orders of magnitude faster than state-of-the-art repair tools. At the same time, the quality of the repairs in terms of $F_{1}$ -score ranges from comparable to better compared to these tools.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Software Testing and Debugging Techniques · Software Reliability and Analysis Research