Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification
Kyuri Im, Shuzhou Yuan, and Michael F\"arber

TL;DR
This paper investigates the false refusal behavior of large language models in hate speech detoxification, revealing biases towards certain groups and proposing a translation-based mitigation strategy to reduce refusals.
Contribution
It systematically analyzes biases causing false refusals in LLMs and introduces a simple cross-translation method to mitigate this issue.
Findings
LLMs disproportionately refuse toxic inputs targeting specific groups.
Multilingual datasets show lower false refusal rates but still exhibit biases.
Cross-translation significantly reduces false refusals while maintaining content integrity.
Abstract
While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Sentiment Analysis and Opinion Mining
