Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Julia Bazinska, Max Mathys, Francesco Casucci, Mateo Rojas-Carulla, Xander Davies, Alexandra Souly, Niklas Pfister

TL;DR
This paper introduces a systematic framework called threat snapshots to evaluate the security of backbone LLMs in AI agents, revealing key insights about vulnerabilities and guiding security improvements.
Contribution
It presents the threat snapshots framework and the $b^3$ benchmark for assessing LLM security, addressing limitations of prior methods and enabling comprehensive vulnerability analysis.
Findings
Enhanced reasoning improves LLM security.
Model size does not correlate with security.
The $b^3$ benchmark includes 194,331 adversarial attacks.
Abstract
AI agents powered by large language models (LLMs) are being deployed at scale, yet we lack a systematic understanding of how the choice of backbone LLM affects agent security. The non-deterministic sequential nature of AI agents complicates security modeling, while the integration of traditional software with AI components entangles novel LLM vulnerabilities with conventional security risks. Existing frameworks only partially address these challenges as they either capture specific vulnerabilities only or require modeling of complete agents. To address these limitations, we introduce threat snapshots: a framework that isolates specific states in an agent's execution flow where LLM vulnerabilities manifest, enabling the systematic identification and categorization of security risks that propagate from the LLM to the agent level. We apply this framework to construct the benchmark, a…
Peer Reviews
Decision·ICLR 2026 Poster
1. The Threat Snapshot formalism is a major conceptual contribution that elegantly decouples LLM-specific vulnerabilities from agentic system context, enabling generalizable benchmarking and red teaming. 2. The benchmark (79K attacks, 27 models) is very comprehensive. It is valuable for both researchers and practitioners.
The threat-snapshot abstraction focuses on isolated LLM calls and single-backbone behavior, but the paper does not convincingly show that these snapshots still isolate backbone vulnerabilities when execution flows interleave multiple LLMs or long multi-turn interactions. In real agent deployments, control flow, state handoffs, and interaction between multiple models can create emergent attack surfaces (e.g., prompt-infection cascading across agents) that a single-call snapshot may miss. This lea
1. **Novel and Practical Framework**: The threat snapshot concept is a clear and effective abstraction. It intelligently decomposes the complex problem of "agent security" into a more manageable one: evaluating the backbone LLM's security at a specific, contextualized state. This approach greatly simplifies the evaluation process. 2. **Massive, High-Quality Benchmark ($b^3$)**: The core contribution is the $b^3$ benchmark. The dataset of nearly 80,000 human-generated adversarial attacks, collec
1. **Scope Limited to Single-Agent Scenarios**: While the paper mentions that the framework could apply to Multi-Agent Systems (MAS), the 10 threat snapshots (Table 2) all focus exclusively on single-agent contexts. The evaluation misses key MAS-specific security risks, such as inter-agent deception, manipulation, or collusion. 2. **Inability to Capture Long-Horizon Attacks**: The "threat snapshot" method, by design, evaluates security at a single point in time. This makes it difficult to captur
1. Proposes a novel threat snapshot framework to systematically analyze LLM security in agents. 1. Builds the b³ benchmark with diverse, realistic attack scenarios and defense levels. 1. Uses crowdsourced red-teaming to collect over 79,000 real attack prompts. 1. Reveals that reasoning modes improve robustness, while size doesn’t guarantee safety. 1. Offers clear practical insights for designing safer AI agents.
1. The paper does not evaluate model utility or performance trade-offs, making it hard to identify backbones that balance security and capability. 1. The tested defenses are limited, excluding some sota defense methods. 1. The threat snapshot abstraction only captures single-step interactions, overlooking multi-turn or long-horizon attacks that occur in real agents.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
