David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning
Samuel Nellessen, Tal Kachman

TL;DR
This paper introduces a formal threat model called Tag-Along Attacks for autonomous language model agents and presents Slingshot, a reinforcement learning framework that autonomously discovers effective, transferable jailbreaking attack vectors.
Contribution
The paper formalizes Tag-Along Attacks as a verifiable threat model and develops Slingshot, a reinforcement learning method that autonomously finds transferable agent jailbreaking strategies.
Findings
Slingshot achieves 67% success rate against a strong language model operator.
Attacks transfer zero-shot to various models, including closed-source and fine-tuned open models.
Learned attacks tend to be short, instruction-like patterns rather than multi-turn persuasion.
Abstract
The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
