Probing Association Biases in LLM Moderation Over-Sensitivity

Yuxin Wang; Botao Yu; Ivory Yang; Saeed Hassanpour; Soroush Vosoughi

arXiv:2505.23914·cs.CL·March 19, 2026

Probing Association Biases in LLM Moderation Over-Sensitivity

Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi

PDF

Open Access

TL;DR

This paper investigates the over-sensitivity of large language models in content moderation, revealing that systematic topic-toxicity associations contribute to false positives beyond explicit offensive triggers, and proposes a new analysis method to understand this behavior.

Contribution

The paper introduces Topic Association Analysis, a novel probe to quantify topic-toxicity associations in LLMs, highlighting their role in over-sensitivity and false positives in moderation tasks.

Findings

01

Advanced models show stronger topic-association skew in false positives.

02

Topic cues can influence false-positive rates through prefix interventions.

03

Mitigating over-sensitivity may require addressing learned topic associations.

Abstract

Large Language Models are widely used for content moderation but often present certain over-sensitivity, leading to misclassification of benign content and rejecting safe user commands. While previous research attributes this issue primarily to the presence of explicit offensive triggers, we statistically reveal a deeper connection beyond token level: When behaving over-sensitively, particularly on decontextualized statements, LLMs exhibit systematic topic-toxicity association patterns that go beyond explicit offensive triggers. To characterize these patterns, we propose Topic Association Analysis, a behavior-based probe that elicits short contextual scenarios for benign inputs and quantifies topic amplification between the scenario and the original comment. Across multiple LLMs and large-scale data, we find that more advanced models (e.g., GPT-4 Turbo) show stronger topic-association…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Nuclear Materials and Properties · Topic Modeling

MethodsLinear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Attention Is All You Need