Probing Association Biases in LLM Moderation Over-Sensitivity
Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi

TL;DR
This paper investigates the over-sensitivity of large language models in content moderation, revealing that systematic topic-toxicity associations contribute to false positives beyond explicit offensive triggers, and proposes a new analysis method to understand this behavior.
Contribution
The paper introduces Topic Association Analysis, a novel probe to quantify topic-toxicity associations in LLMs, highlighting their role in over-sensitivity and false positives in moderation tasks.
Findings
Advanced models show stronger topic-association skew in false positives.
Topic cues can influence false-positive rates through prefix interventions.
Mitigating over-sensitivity may require addressing learned topic associations.
Abstract
Large Language Models are widely used for content moderation but often present certain over-sensitivity, leading to misclassification of benign content and rejecting safe user commands. While previous research attributes this issue primarily to the presence of explicit offensive triggers, we statistically reveal a deeper connection beyond token level: When behaving over-sensitively, particularly on decontextualized statements, LLMs exhibit systematic topic-toxicity association patterns that go beyond explicit offensive triggers. To characterize these patterns, we propose Topic Association Analysis, a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between the scenario and the original comment. Across multiple LLMs and large-scale data, we find that more advanced models (e.g., GPT-4 Turbo) show stronger topic-association…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Nuclear Materials and Properties · Topic Modeling
MethodsLinear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Attention Is All You Need
