Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp
Rachel Hong, William Agnew, Tadayoshi Kohno, and Jamie Morgenstern

TL;DR
This paper critically examines how CLIP-based filtering in data collection introduces biases, excludes marginalized groups, and fails to effectively filter explicit or copyrighted content, highlighting the need for improved dataset curation practices.
Contribution
It provides an empirical analysis of CLIP-filtering biases in DataComp, revealing exclusion of marginalized groups and shortcomings in content filtering methods.
Findings
CLIP-filtering disproportionately excludes data related to marginalized groups.
Filtering amplifies existing underrepresentation of certain demographics.
NSFW filter fails to remove explicit content effectively.
Abstract
As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
