Safety and Fairness for Content Moderation in Generative Models
Susan Hao, Piyush Kumar, Sarah Laszlo, Shivani Poddar, Bhaktipriya, Radharapu, Renee Shelby

TL;DR
This paper develops a theoretical framework and empirical methods for responsible content moderation in generative AI, focusing on safety, fairness, and harm quantification to enable data-driven moderation strategies.
Contribution
It introduces a novel framework for conceptualizing and measuring safety, fairness, and harms in text-to-image generative models, advancing responsible deployment practices.
Findings
Defined and distinguished safety, fairness, and metric equity concepts.
Demonstrated empirical measurement of harms in generative models.
Showcased how harm quantification supports data-driven moderation.
Abstract
With significant advances in generative AI, new technologies are rapidly being deployed with generative components. Generative models are typically trained on large datasets, resulting in model behaviors that can mimic the worst of the content in the training data. Responsible deployment of generative technologies requires content moderation strategies, such as safety input and output filters. Here, we provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies, including a demonstration of how to empirically measure the constructs we enumerate. We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. We then provide a demonstration of how the defined harms can be quantified. We conclude with a summary of how the style of harms quantification we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
