TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems
Ishan Kavathekar, Hemang Jain, Ameya Rathod, Ponnurangam Kumaraguru, Tanuja Ganu

TL;DR
TAMAS is a comprehensive benchmark designed to evaluate the robustness and safety of multi-agent LLM systems against adversarial threats, revealing significant vulnerabilities and guiding future defenses.
Contribution
This paper introduces TAMAS, the first benchmark specifically targeting adversarial risks in multi-agent LLM systems, including diverse scenarios, attack types, and a new robustness score.
Findings
Multi-agent systems are highly vulnerable to adversarial attacks.
Current frameworks show critical failure modes under adversarial conditions.
TAMAS provides a systematic way to study and improve multi-agent LLM safety.
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents through tool use, planning, and decision-making abilities, leading to their widespread adoption across diverse tasks. As task complexity grows, multi-agent LLM systems are increasingly used to solve problems collaboratively. However, safety and security of these systems remains largely under-explored. Existing benchmarks and datasets predominantly focus on single-agent settings, failing to capture the unique vulnerabilities of multi-agent dynamics and co-ordination. To address this gap, we introduce hreats and ttacks in ulti-gent ystems (), a benchmark designed to evaluate the robustness and safety of multi-agent LLM systems. TAMAS includes five distinct scenarios comprising 300 adversarial instances across six attack types…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea is interesting. - Its studied problem of adversarial vulerabilities in multi-agent systems is interesting.
- The paper's claims to originality are overstated, it fails to properly clarift what is fundamentally new about their evaluation of adversarial attacks compared against the attacks under the LLM or single agent context. - The tasks and tools are partly synthetic, so it is unclear how well results transfer to real systems with live APIs and true side effects. - The quality of the benchmark execution is also questionable; the dataset of 300 adversarial instances seems small for the scope of the
1, The topic of assessing multi-agent system safety is timely and important. 2, The benchmark includes multiple tasks, multiple constructed prompts, and the corresponding metric. And the evaluation includes multiple agentic structures.
1, Lack of comparison with other agent safey benchmarks [1,2,3], what's the difference and main contribution compared to these benchmarks? [1] AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. [2] Agent-SafetyBench: Evaluating the Safety of LLM Agents. [3] Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents 2, Lack of attack scenarios. For example, the MAS jailbreak [4], malicious coding behavior [5], and agent conversation behavior
1. Originality and Significance: to the best of my knowledge, this is the first benchmark to evaluate the safety and robustness of multi-agent LLM systems, especially for >= 3 agents. Also, some attacks specified for MAS, such as Byzantine, Colluding, and Contradicting are also tested. The topic is also an important topic that the community would be interested in, as it addresses under-explored and systemic risks of MAS. 2. The benchmark provides extensive evaluation, spanning five domains, 3
Weakness 1: Limited Scope of Adversarial Goals (Disruption vs. Misuse) A limitation of the benchmark is its focus on attacks that disrupt a given task (e.g., Byzantine, Contradicting agents) or manipulate the immediate output (e.g., prompt injection), rather than testing for more severe, exploitative misuse. The safety community is increasingly concerned with threat actors instrumentalizing systems for inherently harmful, multi-step goals. The current benchmark does not appear to evaluate scena
1. The paper introduces TAMAS, the first benchmark to systematically evaluate the safety of multi-agent LLM systems. Its key innovation is defining and testing "multi-agent-specific risks" (like Byzantine, Colluding, and Contradicting agents) , which "have no analog in single-agent setups". 2. The work is methodologically rigorous. The TAMAS benchmark is comprehensive, spanning 300 adversarial instances across five domains and six attack types. The evaluation is thorough, testing 10 LLM backbon
The core weakness is the paper's focus on demonstrating failure without providing actionable steps for mitigation or a deep root cause analysis. * Missing Defenses: The paper does not test the effectiveness of simple, common defenses, like providing agents with explicit refusal instructions (safety guardrails) in their prompts, which limits its practical use. * Shallow Analysis: It needs a deeper root cause analysis to distinguish whether failures are due to: Model-Level Compliance (LLM ignori
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
