Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models
Ana-Maria Cretu, Klim Kireev, Amro Abdalla, Wisdom Obinna, Raphael Meier, Sarah Adel Bargal, Elissa M. Redmiles, Carmela Troncoso

TL;DR
This paper assesses the effectiveness of concept filtering defenses in text-to-image models to prevent the generation of child sexual abuse material, revealing current limitations and challenges in evaluation.
Contribution
It demonstrates that existing filtering methods are insufficient against open-weight models and discusses the complexities of evaluating filtering defenses.
Findings
Current detection methods cannot remove all child images from datasets.
Filtering reduces model generality and can be bypassed with prompting strategies.
Filtering offers limited protection for open-weight models against CSAM generation.
Abstract
We evaluate the effectiveness of filtering child images from training datasets of text-to-image models to prevent model misuse to create child sexual abuse material (CSAM). First, we capture the complexity of preventing CSAM generation using a game-based security definition. Second, we show that current detection methods cannot remove all children from a dataset. Third, using an ethical proxy for CSAM (a child wearing glasses), we show that even when only a small percentage of child images are left in the training dataset after filtering, there exist prompting strategies that generate a child wearing glasses using only a few more queries than when the model is trained on the unfiltered data. Fine-tuning the filtered model on child images further reduces the additional query overhead. We also show that re-introducing a concept is possible via fine-tuning even if filtering is perfect. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
