A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo

TL;DR
This paper proposes a content-based framework for cybersecurity refusal decisions in large language models, explicitly modeling offense-defense tradeoffs to improve consistency and tunability of refusal policies.
Contribution
It introduces a novel content-grounded approach that characterizes requests along five dimensions to better manage offensive risks and defensive benefits.
Findings
Resolves inconsistencies in current refusal policies.
Enables construction of tunable, risk-aware refusal strategies.
Grounds refusal decisions in technical request content.
Abstract
Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Information and Cyber Security · Adversarial Robustness in Machine Learning
