Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

TL;DR
This paper investigates how adversarial prompt injections cause the success rate of attacks on large language models to scale polynomially or exponentially, depending on prompt length, supported by a theoretical spin-glass model.
Contribution
It introduces a minimal statistical framework and a spin-glass inspired generative model to explain the polynomial-exponential crossover in attack success scaling.
Findings
Short prompts lead to power-law attack success scaling.
Long prompts cause exponential attack success growth.
Theoretical model aligns with empirical observations in large language models.
Abstract
Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that strong adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We point out how this model naturally realizes the minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
