Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang; Ananth Balashankar; Varun Chandrasekaran

arXiv:2603.11149·cs.LG·March 20, 2026

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

PDF

Open Access

TL;DR

This paper develops a systematic framework to analyze how the success of jailbreak attacks on large language models scales with attacker effort, revealing insights into efficiency, attack strategies, and vulnerability patterns across methods and goals.

Contribution

It introduces a scaling-law framework for jailbreak attacks, compares different attack paradigms, and provides empirical insights into their efficiency and effectiveness across models and harm types.

Findings

01

Prompt-based attacks are more compute-efficient than optimization-based methods.

02

Prompt-based attacks optimize more effectively in prompt space.

03

Vulnerability varies significantly with attack goals, especially for misinformation.

Abstract

Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts