Beyond Single-Agent Alignment: Preventing Context-Fragmented Violations in Multi-Agent Systems
Jie Wu, Ming Gong

TL;DR
This paper introduces Distributed Sentinel, a distributed enforcement architecture with Semantic Taint Tokens, to prevent cross-agent policy violations in multi-agent systems, outperforming existing methods on a new benchmark.
Contribution
It proposes a novel distributed enforcement system with Semantic Taint Tokens and a comprehensive benchmark for cross-agent policy violations, advancing multi-agent security measures.
Findings
Distributed Sentinel achieves 0.95 F1 score on the PhantomEcosystem benchmark.
Existing prompt-based filtering achieves 0.85 F1, rule-based DLP 0.65.
All evaluated LLMs show high violation rates (14-98%) in multi-agent workflows.
Abstract
We identify and formalize a novel security risk: Context-Fragmented Violations (CFVs) - a class of policy breaches where individual agent actions appear locally safe and reasonable, yet collectively violate organizational policies because critical policy facts are siloed in different departments private contexts. Existing prompt-based alignment mechanisms and monolithic interceptors are poorly matched to violations that span contextual islands. We propose Distributed Sentinel, a distributed zero-trust enforcement architecture that introduces the Semantic Taint Token (STT) Protocol. Through lightweight sidecar proxies, our system propagates security state across organizational boundaries without exposing raw cross-domain data, enabling Counterfactual Graph Simulation for cross-domain policy verification. We construct PhantomEcosystem, a comprehensive benchmark comprising 9 categories of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
