Adversarial Reasoning at Jailbreaking Time

Mahdi Sabbaghi; Paul Kassianik; George Pappas; Yaron Singer; Amin Karbasi; Hamed Hassani

arXiv:2502.01633·cs.LG·June 26, 2025

Adversarial Reasoning at Jailbreaking Time

Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, Hamed Hassani

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents an adversarial reasoning method to automatically jailbreak aligned large language models, achieving state-of-the-art attack success rates and highlighting vulnerabilities to improve future robustness.

Contribution

It introduces a novel adversarial reasoning approach that leverages test-time compute to effectively jailbreak LLMs, advancing understanding of their vulnerabilities.

Findings

01

Achieves state-of-the-art attack success rates against aligned LLMs.

02

Demonstrates effectiveness of test-time compute in adversarial attacks.

03

Highlights new vulnerabilities in LLM alignment strategies.

Abstract

As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

helloworld10011/adversarial-reasoning
pytorchOfficial

Videos

Adversarial Reasoning at Jailbreaking Time· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning