FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities   in Semantic Dataset Deduplication

Eric Slyman; Stefan Lee; Scott Cohen; Kushal Kafle

arXiv:2404.16123·cs.CV·April 26, 2024

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

PDF

Open Access

TL;DR

FairDeDup is a method that improves fairness in vision-language models by reducing biases during dataset deduplication, maintaining performance while promoting social fairness.

Contribution

The paper introduces FairDeDup, a modification to existing deduplication algorithms that mitigates social biases in vision-language datasets without sacrificing model performance.

Findings

01

FairDeDup reduces social biases in trained models.

02

FairDeDup maintains zero-shot performance on benchmarks.

03

FairDeDup outperforms SemDeDup in fairness metrics.

Abstract

Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal and Policy Issues

MethodsDataset Pruning · Pruning · Contrastive Language-Image Pre-training