Eris: Measuring discord among multidimensional data sources

Alberto Abello; James Cheney

arXiv:2201.13302·cs.DB·September 21, 2023

Eris: Measuring discord among multidimensional data sources

Alberto Abello, James Cheney

PDF

Open Access 1 Repo

TL;DR

This paper introduces a metric called Eris for quantifying disagreement among multidimensional data sources, especially when ground truth is unavailable, to improve data trustworthiness and decision-making.

Contribution

It defines the concept of data concordance and proposes a discordance metric along with algebraic operators and database implementations for efficient measurement.

Findings

01

Efficient discordance measurement on COVID-19 and synthetic data

02

Algebraic operators with correctness guarantees for data alignment

03

Linear and quadratic programming approaches for implementation

Abstract

Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision making based on trustworthiness. We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dtim-upc/eris
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies