Multimodal datasets: misogyny, pornography, and malignant stereotypes
Abeba Birhane, Vinay Uday Prabhu, Emmanuel Kahembwe

TL;DR
This paper critically examines the LAION-400M dataset, revealing it contains explicit and harmful content such as misogyny, pornography, and stereotypes, raising concerns about the safety and ethics of large-scale datasets used in AI.
Contribution
It provides a detailed analysis of problematic content in LAION-400M, highlighting the need for better curation and oversight of large datasets for AI training.
Findings
LAION-400M contains explicit and harmful images and text.
The dataset includes racist, sexist, and stereotypical content.
Implications for downstream harms and ethical concerns are discussed.
Abstract
We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI's CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGender, Feminism, and Media · Sexuality, Behavior, and Technology · Multimodal Machine Learning Applications
