The Dark Side of Dataset Scaling: Evaluating Racial Classification in   Multimodal Models

Abeba Birhane; Sepehr Dehdashtian; Vinay Uday Prabhu; Vishnu Boddeti

arXiv:2405.04623·cs.CY·May 9, 2024

The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models

Abeba Birhane, Sepehr Dehdashtian, Vinay Uday Prabhu, Vishnu Boddeti

PDF

1 Repo

TL;DR

This study investigates how scaling large multimodal datasets impacts racial and gender biases in vision-language models, revealing increased bias in larger models trained on bigger datasets and discussing mitigation strategies.

Contribution

It provides a comprehensive evaluation of bias changes in 14 visio-linguistic models as dataset size increases, highlighting the complex effects of data scaling on model fairness.

Findings

01

Bias towards criminal classification increases with dataset size in larger models.

02

Smaller models show decreased bias with dataset scaling.

03

Larger models exhibit more racially dehumanizing misclassifications.

Abstract

Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SepehrDehdashtian/the-dark-side-of-dataset-scaling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.