When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Lingxi Zhang, Guangtao Zheng, Hanjie Chen

TL;DR
This paper critically examines the limitations of embedding-based defenses in LLM multi-agent systems, proposing confidence scores as a more robust alternative to detect malicious communication.
Contribution
It provides a theoretical analysis of embedding-based defense failures and introduces confidence scores to improve robustness against sophisticated attacks.
Findings
Embedding-based defenses can be circumvented by crafted messages with similar embeddings.
Confidence scores at token level can help detect malicious messages when embeddings are indistinguishable.
Using confidence scores improves robustness across different models, datasets, and communication structures.
Abstract
Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
