Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David; Arthur Gervais

arXiv:2605.19722·cs.CR·May 20, 2026

Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David, Arthur Gervais

PDF

TL;DR

This paper introduces a trace-based benchmark to evaluate safety alignment in autonomous security agents, comparing different language models and derivatives on security and coding tasks.

Contribution

It presents a comprehensive system-level evaluation method for safety alignment, emphasizing the importance of multiple metrics beyond refusal rates.

Findings

01

Gemma models show higher success on security tasks with less-restricted versions.

02

Safety effects are not specific to security tasks; they also appear in coding tasks.

03

Hard proof-of-trigger and patch-verification tasks remain unsolved.

Abstract

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.