Insider Attacks in Multi-Agent LLM Consensus Systems
Xiaolin Sun, Zixuan Liu, Yibin Hu, Zizhan Zheng

TL;DR
This paper investigates insider attacks in multi-agent LLM consensus systems, proposing a world-model-based reinforcement learning framework to optimize malicious strategies and demonstrate their effectiveness.
Contribution
It introduces a novel framework combining latent world models and reinforcement learning to optimize insider attacks in multi-agent LLM systems.
Findings
The trained attacker reduces benign consensus rate.
The attacker prolongs disagreement more effectively than baseline.
Preliminary results validate the attack framework's effectiveness.
Abstract
Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
