Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large   Language Models

Arka Dutta; Adel Khorramrouz; Sujan Dutta; Ashiqur R.; KhudaBukhsh

arXiv:2309.06415·cs.CL·April 2, 2024·1 cites

Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R., KhudaBukhsh

PDF

Open Access

TL;DR

This paper introduces a new framework called 'toxicity rabbit hole' for bias auditing of large language models, revealing insights into toxic content related to various identity groups and discussing potential societal impacts.

Contribution

It proposes a novel, generalizable framework for eliciting toxic content from large language models and applies it to analyze biases across multiple models and identity groups.

Findings

01

Bias audit of PaLM 2 reveals significant toxic content.

02

Framework generalizes across several language models.

03

Highlights risks related to racism, antisemitism, misogyny, Islamophobia, homophobia, transphobia.

Abstract

This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsPathways Language Model