Policy-Invisible Violations in LLM-Based Agents
Jie Wu, Ming Gong

TL;DR
This paper introduces PhantomPolicy, a benchmark for policy-invisible violations in LLM-based agents, and Sentinel, a graph-based enforcement framework that improves policy compliance by grounding in world state.
Contribution
It presents a new benchmark for policy-invisible violations and a novel enforcement framework that leverages world state simulation to enhance policy adherence.
Findings
Sentinel outperforms content-only baselines with 93.0% accuracy.
Manual review changed 5.3% of trace labels, highlighting the need for detailed analysis.
The benchmark covers eight violation categories with balanced cases.
Abstract
LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
