The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

TL;DR
This paper reveals that increasing agent capabilities in multi-agent LLM systems can paradoxically reduce security due to linguistic certainty, and proposes ensemble verification to mitigate this issue.
Contribution
It uncovers the capability paradox in multi-agent systems and introduces a novel ensemble verification method leveraging capability asymmetries for enhanced security.
Findings
Higher Worker capabilities increase attack success rates significantly.
Linguistic certainty mediates the relationship between capability and security failure.
Heterogeneous ensemble verification reduces attack success rate from 52.8% to 2.0%.
Abstract
Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
