Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight

TL;DR
This paper reveals that safety-aligned large language models often refuse legitimate cybersecurity assistance due to a bias towards refusing requests with security-related language, which hampers defensive efforts.
Contribution
It identifies and quantifies Defensive Refusal Bias in safety-tuned LLMs, highlighting its impact on cybersecurity tasks and suggesting the need for improved mitigation strategies.
Findings
LLMs refuse security-sensitive requests at 2.72 times the rate of neutral requests
Refusal rates are highest in system hardening and malware analysis tasks
Explicit authorization increases refusal rates, indicating models interpret justifications as adversarial
Abstract
Safety alignment in large language models (LLMs), particularly for cybersecurity tasks, primarily focuses on preventing misuse. While this approach reduces direct harm, it obscures a complementary failure mode: denial of assistance to legitimate defenders. We study Defensive Refusal Bias -- the tendency of safety-tuned frontier LLMs to refuse assistance for authorized defensive cybersecurity tasks when those tasks include similar language to an offensive cyber task. Based on 2,390 real-world examples from the National Collegiate Cyber Defense Competition (NCCDC), we find that LLMs refuse defensive requests containing security-sensitive keywords at the rate of semantically equivalent neutral requests (). The highest refusal rates occur in the most operationally critical tasks: system hardening (43.8%) and malware analysis (34.3%). Interestingly, explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
