When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto; Md Labid Al Nahiyan; Md Tanvir Hassan

arXiv:2512.01037·cs.CL·December 22, 2025

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan

PDF

Open Access

TL;DR

This paper introduces a framework and metrics to measure local semantic inconsistency in language model refusals, revealing nuanced failure modes not captured by traditional global metrics.

Contribution

It proposes the concept of semantic confusion, develops a new dataset ParaGuard, and introduces three token-level metrics to diagnose and improve model safety refusals.

Findings

01

Metrics reveal unstable refusal boundaries in models.

02

Localized pockets of inconsistency exist despite overall safety.

03

Stricter refusal policies do not always increase inconsistency.

Abstract

Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Security and Verification in Computing