Auditing for Diversity using Representative Examples
Vijay Keswani, L. Elisa Celis

TL;DR
This paper introduces a cost-effective method for estimating dataset diversity related to protected attributes by leveraging a small labeled control set and similarity measures, reducing the need for extensive labeling.
Contribution
It proposes a novel algorithm that uses a small labeled control set and similarity metrics to approximate dataset disparity, with theoretical guarantees and adaptive control set construction.
Findings
Effective approximation of dataset disparity with small control sets
Adaptive control sets outperform random selection in reducing approximation error
Demonstrated success on image and Twitter datasets
Abstract
Assessing the diversity of a dataset of information associated with people is crucial before using such data for downstream applications. For a given dataset, this often involves computing the imbalance or disparity in the empirical marginal distribution of a protected attribute (e.g. gender, dialect, etc.). However, real-world datasets, such as images from Google Search or collections of Twitter posts, often do not have protected attributes labeled. Consequently, to derive disparity measures for such datasets, the elements need to hand-labeled or crowd-annotated, which are expensive processes. We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset, with respect to a protected attribute, using a control set of labeled representative examples. Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Internet Traffic Analysis and Secure E-voting · Machine Learning and Algorithms
