TL;DR
This paper introduces NARCBench, a benchmark for detecting multi-agent collusion in language models, proposing five probing techniques that effectively identify collusion signatures across various models and scenarios.
Contribution
It presents a new benchmark and probing methods for multi-agent collusion detection, extending interpretability from single models to multi-agent systems.
Findings
All models achieved perfect AUROC in-distribution.
Probing techniques achieved 0.73 to 0.93 AUROC zero-shot transfer.
Detection performance improved with model capability.
Abstract
As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level, evaluated across four open-weight models (Qwen3-32B, Llama-3.1-70B, DeepSeek-R1 32B, GPT-OSS-20B) and six probe architectures. We frame this as a distributed anomaly detection problem, identifying three collusion signatures that map onto distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
