StepShield: When, Not Whether to Intervene on Rogue Agents

Gloria Felicia (University of Virginia); Michael Eniolade (University of the Cumberlands); Jinfeng He (Cornell University); Zitha Sasindran (Indian Institute of Science Bangalore); Hemant Kumar (University of Arizona); Milan Hussain Angati (California State University Northridge); Sandeep Bandarupalli (University of Cincinnati)

arXiv:2601.22136·cs.LG·January 30, 2026

StepShield: When, Not Whether to Intervene on Rogue Agents

Gloria Felicia (University of Virginia), Michael Eniolade (University of the Cumberlands), Jinfeng He (Cornell University), Zitha Sasindran (Indian Institute of Science Bangalore), Hemant Kumar (University of Arizona)

PDF

Open Access

TL;DR

StepShield introduces a novel benchmark for evaluating the timing of violation detection in AI agents, emphasizing early intervention over binary accuracy, with significant implications for safety and cost savings.

Contribution

It presents the first benchmark to measure when violations are detected, introduces new temporal metrics, and demonstrates the economic benefits of early detection methods.

Findings

01

LLM-based judge achieves 59% early intervention rate

02

Static analyzer achieves 26% early intervention rate

03

Early detection reduces monitoring costs by 75%

Abstract

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Ethics and Social Impacts of AI