David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen; Tal Kachman

arXiv:2602.02395·cs.LG·February 3, 2026

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman

PDF

Open Access

TL;DR

This paper introduces a formal threat model called Tag-Along Attacks for autonomous language model agents and presents Slingshot, a reinforcement learning framework that autonomously discovers effective, transferable jailbreaking attack vectors.

Contribution

The paper formalizes Tag-Along Attacks as a verifiable threat model and develops Slingshot, a reinforcement learning method that autonomously finds transferable agent jailbreaking strategies.

Findings

01

Slingshot achieves 67% success rate against a strong language model operator.

02

Attacks transfer zero-shot to various models, including closed-source and fine-tuned open models.

03

Learned attacks tend to be short, instruction-like patterns rather than multi-turn persuasion.

Abstract

The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI