Measuring Safety Alignment Effects in Autonomous Security Agents
Isaac David, Arthur Gervais

TL;DR
This paper introduces a trace-based benchmark to evaluate safety alignment in autonomous security agents, comparing different language models and derivatives on security and coding tasks.
Contribution
It presents a comprehensive system-level evaluation method for safety alignment, emphasizing the importance of multiple metrics beyond refusal rates.
Findings
Gemma models show higher success on security tasks with less-restricted versions.
Safety effects are not specific to security tasks; they also appear in coding tasks.
Hard proof-of-trigger and patch-verification tasks remain unsolved.
Abstract
Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
