A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

Noa Linder; Meirav Segal; Omer Antverg; Gil Gekker; Tomer Fichman; Omri Bodenheimer; Edan Maor; Omer Nevo

arXiv:2602.15689·cs.CL·February 19, 2026

A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo

PDF

Open Access

TL;DR

This paper proposes a content-based framework for cybersecurity refusal decisions in large language models, explicitly modeling offense-defense tradeoffs to improve consistency and tunability of refusal policies.

Contribution

It introduces a novel content-grounded approach that characterizes requests along five dimensions to better manage offensive risks and defensive benefits.

Findings

01

Resolves inconsistencies in current refusal policies.

02

Enables construction of tunable, risk-aware refusal strategies.

03

Grounds refusal decisions in technical request content.

Abstract

Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Information and Cyber Security · Adversarial Robustness in Machine Learning