Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue; Zimo Qi; Guangliang Liu; Bocheng Chen; Ramtin Pedarsani

arXiv:2603.11388·cs.AI·March 13, 2026

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue, Zimo Qi, Guangliang Liu, Bocheng Chen, Ramtin Pedarsani

PDF

Open Access

TL;DR

This paper investigates the overrefusal problem in safety-aligned large language models, analyzes its causes related to linguistic cues, and proposes a mitigation method that improves the balance between safety and usability.

Contribution

It provides a mechanistic understanding of overrefusal and introduces a fine-tuning approach that explicitly considers refusal triggers to reduce overrefusal.

Findings

01

Improved balance between safety and responsiveness in LLMs

02

Outperforms prior methods in mitigating overrefusal

03

Enhances robustness against jailbreak attacks

Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)