When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals
Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan

TL;DR
This paper introduces a framework and metrics to measure local semantic inconsistency in language model refusals, revealing nuanced failure modes not captured by traditional global metrics.
Contribution
It proposes the concept of semantic confusion, develops a new dataset ParaGuard, and introduces three token-level metrics to diagnose and improve model safety refusals.
Findings
Metrics reveal unstable refusal boundaries in models.
Localized pockets of inconsistency exist despite overall safety.
Stricter refusal policies do not always increase inconsistency.
Abstract
Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Security and Verification in Computing
