Large image datasets: A pyrrhic win for computer vision?
Vinay Uday Prabhu, Abeba Birhane

TL;DR
This paper critically examines large-scale vision datasets like ImageNet, revealing ethical issues such as non-consensual content and proposing measures like IRBs to improve dataset curation practices.
Contribution
It provides a detailed ethical analysis of ImageNet, including a census of problematic images, and suggests corrective actions and open-sources tools for the community.
Findings
Identification of verifiably pornographic images in ImageNet
Quantitative analysis of ethical transgressions in datasets
Recommendations for ethical dataset curation practices
Abstract
In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Face recognition and analysis · Sexuality, Behavior, and Technology
