Assessing and Remedying Coverage for a Given Dataset

Abolfazl Asudeh; Zhongjun Jin; H. V. Jagadish

arXiv:1810.06742·cs.DB·April 27, 2023·1 cites

Assessing and Remedying Coverage for a Given Dataset

Abolfazl Asudeh, Zhongjun Jin, H. V. Jagadish

PDF

Open Access

TL;DR

This paper presents methods to evaluate and improve dataset coverage across multiple attributes, aiming to reduce bias and vulnerabilities in data-driven decision-making.

Contribution

It introduces efficient techniques for identifying coverage gaps and estimating the minimal additional data needed to address them.

Findings

01

Effective identification of coverage gaps in datasets.

02

Quantitative methods to determine minimal data augmentation.

03

Validated approaches through experiments on real datasets.

Abstract

Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks. In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Data Management and Algorithms · Bayesian Modeling and Causal Inference