Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Nirmit Arora; Sathvik Joel; Ishan Kavathekar; Palak; Rohan Gandhi; Yash Pandya; Tanuja Ganu; Aditya Kanade; Akshay Nambi

arXiv:2511.10949·cs.MA·November 17, 2025

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

Nirmit Arora, Sathvik Joel, Ishan Kavathekar, Palak, Rohan Gandhi, Yash Pandya, Tanuja Ganu, Aditya Kanade, Akshay Nambi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SafeAgents, a framework for assessing security vulnerabilities in multi-agent systems against adversarial prompts, revealing that common design patterns can be significantly vulnerable and emphasizing the need for security-aware design.

Contribution

SafeAgents provides a unified, extensible framework and a diagnostic measure Dharma to systematically evaluate and identify weak links in multi-agent system security.

Findings

01

Centralized systems are more vulnerable to adversarial prompts.

02

Design choices like plan strategies impact system robustness.

03

Vulnerabilities vary across different architectures and datasets.

Abstract

LLM-based agents are increasingly deployed in multi-agent systems (MAS). As these systems move toward real-world applications, their security becomes paramount. Existing research largely evaluates single-agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi-agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine-grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter-agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi-agent pipelines. Using SafeAgents, we conduct a comprehensive study across five…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper tackles a vital and forward-looking research gap. As the community shifts from single-agent systems to more complex multi-agent collaborations, understanding their unique security profile is paramount. The authors correctly argue that single-agent safety alignment does not guarantee MAS safety, providing a strong motivation for their work. The paper delivers non-obvious and important insights. The central finding—that low sub-agent autonomy (atomic delegation) is a critical weak link—

Weaknesses

1. The DHARMA metric itself is well-designed, but its implementation relies on an LLM-as-judge to classify trajectories. This introduces a significant potential source of error and non-determinism. The paper provides the (very long) prompts used for classification but offers no validation of the judge's accuracy. A misclassification by the judge (e.g., labeling a Sub-agent-Ignored as Unmitigated Execution) could materially skew the core results in Table 2. The reliability of the paper's central

Reviewer 02Rating 4Confidence 4

Strengths

1. Addresses a timely and underexplored problem of security in multi-agent LLMs. 2. Broad empirical coverage across multiple architectures, datasets, and models.

Weaknesses

1. The paper has limited technical contribution. The “weak links” analysis and DHARMA classes are mostly heuristic categorizations without strong theoretical grounding. 2. The paper has limited novelty. The contributions are primarily in system design and taxonomy rather than algorithmic innovation.

Reviewer 03Rating 4Confidence 3

Strengths

1. Conceptual move from “did it refuse?” to “where did it fail?” is overdue and useful in practice. 2. DHARMA’s top-level split (planner vs sub-agent) is intuitive and maps to actionable design knobs. 3. SAFEAGENTS abstraction could enable apples-to-apples comparisons across frameworks instead of the usual benchmark bingo. 4. It shows centralized planners can spread bad plans efficiently when guardrails wobble.

Weaknesses

1. The DHARMA taxonomy contains internal inconsistencies that make categories non-exclusive. In particular, the leaf describing planner failure simultaneously references the absence of a valid plan and the continuation “despite a valid plan,” which is self-contradictory. 2. The paper relies on a single LLM-as-judge to assign DHARMA labels, yet it provides no human adjudication study, no second judge model, and no error bars. Because the core claims depend on these fine-grained labels, the absen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Information and Cyber Security