Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models
Arka Dutta, Adel Khorramrouz, Sujan Dutta, Ashiqur R., KhudaBukhsh

TL;DR
This paper introduces a new framework called 'toxicity rabbit hole' for bias auditing of large language models, revealing insights into toxic content related to various identity groups and discussing potential societal impacts.
Contribution
It proposes a novel, generalizable framework for eliciting toxic content from large language models and applies it to analyze biases across multiple models and identity groups.
Findings
Bias audit of PaLM 2 reveals significant toxic content.
Framework generalizes across several language models.
Highlights risks related to racism, antisemitism, misogyny, Islamophobia, homophobia, transphobia.
Abstract
This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsPathways Language Model
