Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan,, Karthik Narasimhan

TL;DR
This study systematically evaluates toxicity in over half a million ChatGPT outputs, revealing that assigning personas can significantly increase harmful and biased responses, highlighting safety concerns in dialogue-based language models.
Contribution
It provides the first large-scale analysis of how persona assignment affects toxicity and bias in ChatGPT, exposing safety vulnerabilities and biases in current LLMs.
Findings
Persona assignment can increase toxicity up to 6x.
Certain entities are targeted 3x more regardless of persona.
Model exhibits inherent discriminatory biases.
Abstract
Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Hate Speech and Cyberbullying Detection
Methodstravel james
