Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models
Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Sch\"utze, and Michael F\"arber

TL;DR
This study explores how MBTI-based persona prompts influence hate speech detection by large language models, revealing significant persona-driven biases and inconsistencies that impact fairness and annotation reliability.
Contribution
It is the first comprehensive investigation into the effect of persona prompts on LLM hate speech classification, highlighting biases and variability introduced by different personas.
Findings
MBTI traits significantly affect labeling behavior.
Persona prompts cause substantial variation and disagreement.
Biases at the logit level influence model outputs.
Abstract
Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Emotion and Mood Recognition
