Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan N\"other; Adish Singla; Goran Radanovic

arXiv:2508.16481·cs.LG·October 8, 2025

Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Jonathan N\"other, Adish Singla, Goran Radanovic

PDF

Open Access 3 Reviews

TL;DR

This paper introduces BAD-ACTS, a comprehensive benchmark for evaluating the robustness of agentic systems against adversarial harms, revealing vulnerabilities and proposing defenses to improve safety in AI agents.

Contribution

It presents a novel taxonomy of harms, a new benchmark with diverse implementations and harmful examples, and analyzes attack and defense strategies for agentic systems.

Findings

01

High success rate of adversarial attacks on agentic systems

02

Simple prompting defenses are ineffective against attacks

03

Message monitoring improves system robustness

Abstract

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- 238 high-quality examples of harmful actions. - Baseline defenses evaluated, such as safety-inducing prompts and zero-shot monitoring methods.

Weaknesses

- The proposed taxonomy, framed as one of the key contributions, doesn’t list novel points. All of them have been considered in the previous literature. - Heavy reliance on LLM generation for the dataset creation may have introduced potential biases or lack of coverage. It would be useful to expand a discussion on the realism of the tasks and tools. - The proposed benchmark is not properly compared to existing benchmarks. For example, AgentHarm from ICLR 2025 is not discussed. Also, it would be

Reviewer 02Rating 6Confidence 4

Strengths

1. **Novel Problem Formulation**: The paper's primary strength is its focus on a novel and important, if under-explored, threat model: internal adversarial manipulation within an MAS. It formalizes the "insider threat" problem for agentic systems, moving beyond typical external attacks (e.g., user jailbreaks). 2. **Comprehensive Benchmark Engineering**: The creation of five distinct environments with different communication structures (centralized and hierarchical) is a significant engineering

Weaknesses

1. **Questionable Practicality of the Threat Model**: A noteworthy limitation is the practical realism of the threat model. The benchmark assumes an adversary has already achieved full control over one agent. The paper does not justify how this level of control is realistically achieved. Therefore, while the high ASRs are alarming, they are contingent on this "best-case" scenario for the attacker, which may not be broadly applicable. 2. **Lack of Rigor in Taxonomy Generation**: The novelty and

Reviewer 03Rating 6Confidence 2

Strengths

- Timely & impactful problem framing and taxonomy. - Diverse, well-scoped environments. - Clearly written.

Weaknesses

- **Threat model:** The primary attack assumes a fully compromised agent with role-conformant messaging. How do results translate to more realistic threat models (e.g., partial prompt corruption, compromised tool output, or IPI). Any treatment of weaker adversaries would strengthen claims about general robustness. - **Evaluation metric:** Keyword metrics could under-count semantically successful but lexically different attacks and over-count near-misses. More detail on failure modes of the keyw

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Smart Grid Security and Resilience · Network Security and Intrusion Detection