When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Lingxi Zhang; Guangtao Zheng; Hanjie Chen

arXiv:2605.01133·cs.CR·May 5, 2026

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Lingxi Zhang, Guangtao Zheng, Hanjie Chen

PDF

TL;DR

This paper critically examines the limitations of embedding-based defenses in LLM multi-agent systems, proposing confidence scores as a more robust alternative to detect malicious communication.

Contribution

It provides a theoretical analysis of embedding-based defense failures and introduces confidence scores to improve robustness against sophisticated attacks.

Findings

01

Embedding-based defenses can be circumvented by crafted messages with similar embeddings.

02

Confidence scores at token level can help detect malicious messages when embeddings are indistinguishable.

03

Using confidence scores improves robustness across different models, datasets, and communication structures.

Abstract

Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.