Safety and Fairness for Content Moderation in Generative Models

Susan Hao; Piyush Kumar; Sarah Laszlo; Shivani Poddar; Bhaktipriya; Radharapu; Renee Shelby

arXiv:2306.06135·cs.LG·June 13, 2023·5 cites

Safety and Fairness for Content Moderation in Generative Models

Susan Hao, Piyush Kumar, Sarah Laszlo, Shivani Poddar, Bhaktipriya, Radharapu, Renee Shelby

PDF

Open Access

TL;DR

This paper develops a theoretical framework and empirical methods for responsible content moderation in generative AI, focusing on safety, fairness, and harm quantification to enable data-driven moderation strategies.

Contribution

It introduces a novel framework for conceptualizing and measuring safety, fairness, and harms in text-to-image generative models, advancing responsible deployment practices.

Findings

01

Defined and distinguished safety, fairness, and metric equity concepts.

02

Demonstrated empirical measurement of harms in generative models.

03

Showcased how harm quantification supports data-driven moderation.

Abstract

With significant advances in generative AI, new technologies are rapidly being deployed with generative components. Generative models are typically trained on large datasets, resulting in model behaviors that can mimic the worst of the content in the training data. Responsible deployment of generative technologies requires content moderation strategies, such as safety input and output filters. Here, we provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies, including a demonstration of how to empirically measure the constructs we enumerate. We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. We then provide a demonstration of how the defined harms can be quantified. We conclude with a summary of how the style of harms quantification we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection