Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Aaron Rose; Carissa Cullen; Sahar Abdelnabi; Philip Torr; Brandon Gary Kaplowitz; Christian Schroeder de Witt

arXiv:2604.01151·cs.AI·May 12, 2026

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Aaron Rose, Carissa Cullen, Sahar Abdelnabi, Philip Torr, Brandon Gary Kaplowitz, Christian Schroeder de Witt

PDF

1 Repo

TL;DR

This paper introduces NARCBench, a benchmark for detecting multi-agent collusion in language models, proposing five probing techniques that effectively identify collusion signatures across various models and scenarios.

Contribution

It presents a new benchmark and probing methods for multi-agent collusion detection, extending interpretability from single models to multi-agent systems.

Findings

01

All models achieved perfect AUROC in-distribution.

02

Probing techniques achieved 0.73 to 0.93 AUROC zero-shot transfer.

03

Detection performance improved with model capability.

Abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level, evaluated across four open-weight models (Qwen3-32B, Llama-3.1-70B, DeepSeek-R1 32B, GPT-OSS-20B) and six probe architectures. We frame this as a distributed anomaly detection problem, identifying three collusion signatures that map onto distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aaronrose227/narcbench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.