Eris: Measuring discord among multidimensional data sources
Alberto Abello, James Cheney

TL;DR
This paper introduces a metric called Eris for quantifying disagreement among multidimensional data sources, especially when ground truth is unavailable, to improve data trustworthiness and decision-making.
Contribution
It defines the concept of data concordance and proposes a discordance metric along with algebraic operators and database implementations for efficient measurement.
Findings
Efficient discordance measurement on COVID-19 and synthetic data
Algebraic operators with correctness guarantees for data alignment
Linear and quadratic programming approaches for implementation
Abstract
Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision making based on trustworthiness. We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies
