Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

David Campbell; Neil Kale; Udari Madhushani Sehwag; Bert Herring; Nick Price; Dan Borges; Alex Levinson; Christina Q Knight

arXiv:2603.01246·cs.CR·March 12, 2026

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight

PDF

Open Access

TL;DR

This paper reveals that safety-aligned large language models often refuse legitimate cybersecurity assistance due to a bias towards refusing requests with security-related language, which hampers defensive efforts.

Contribution

It identifies and quantifies Defensive Refusal Bias in safety-tuned LLMs, highlighting its impact on cybersecurity tasks and suggesting the need for improved mitigation strategies.

Findings

01

LLMs refuse security-sensitive requests at 2.72 times the rate of neutral requests

02

Refusal rates are highest in system hardening and malware analysis tasks

03

Explicit authorization increases refusal rates, indicating models interpret justifications as adversarial

Abstract

Safety alignment in large language models (LLMs), particularly for cybersecurity tasks, primarily focuses on preventing misuse. While this approach reduces direct harm, it obscures a complementary failure mode: denial of assistance to legitimate defenders. We study Defensive Refusal Bias -- the tendency of safety-tuned frontier LLMs to refuse assistance for authorized defensive cybersecurity tasks when those tasks include similar language to an offensive cyber task. Based on 2,390 real-world examples from the National Collegiate Cyber Defense Competition (NCCDC), we find that LLMs refuse defensive requests containing security-sensitive keywords at $2.72 \times$ the rate of semantically equivalent neutral requests ( $p < 0.001$ ). The highest refusal rates occur in the most operationally critical tasks: system hardening (43.8%) and malware analysis (34.3%). Interestingly, explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques