Capability-Based Scaling Trends for LLM-Based Red-Teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

TL;DR
This paper investigates how the effectiveness of red-teaming attacks on large language models depends on the capability gap between attacker and target, revealing key trends and proposing a predictive scaling curve.
Contribution
It introduces a capability gap framework for red-teaming, analyzes over 600 attacker-target pairs, and derives a jailbreaking scaling curve to predict attack success based on model capabilities.
Findings
More capable models are stronger attackers.
Attack success drops sharply when target exceeds attacker in capability.
Attack success correlates with high performance on social science benchmarks.
Abstract
As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 600 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these…
Peer Reviews
Decision·ICLR 2026 Poster
- The analysis is built upon "more than 600 attacker-target combinations", providing a robust empirical basis for its conclusions. The heatmap in Figure 2 visualizes this extensive dataset, lending significant weight to the observed trends and the derived scaling law. - By fine-tuning models to remove their safety guardrails, they "eliminate the attacker's refusal as a confounding factor". This allows the study to focus on a model's raw ability to craft adversarial prompts, rather than its will
- The paper's central narrative is built on the foundational assumption that general academic benchmark performance (MMLU-Pro) is a valid proxy for the highly specific skills of both offensive jailbreaking and robust defense. This assumption is questionable and introduces a potential circularity. The authors first use MMLU-Pro to define the "capability gap" axis and then, in Section 6.1, show that ASR correlates strongly with the social-science splits of MMLU-Pro (Figure 6). While this correlati
- I really like the research direction and the work presented in the paper is extensive and comprehensive - The experiments and approach are innovative with interesting insights which spark many potential ideas for future followups
- **Scaling Law Misdirection**: My main point of critique is that the paper trying to turn things into a scaling law that really shouldn’t be one. In a way, the notion of there being “scaling laws” is detracting from a lot of the valuable insights this paper is providing. Many curves seem very forced (e.g., in Fig 4) and just drawn through a cloud. I am pretty sure even a linear fit would achieve a similar R^2. There are many parts where correlations are presented as causations, and as discussed
Pros - Semi-comprehensive testing to try to figure out if there's a scaling law for jailbreaking - Useful for forecasting
Cons - It doesn't seem to highlight limitation that "jailbreaks" might evolve or change forms (imo) namely we might care about different jailbreaks or different sensitive topics in the future. - I couldn't find if there was a manual QA somewhere. The setup is LLM judges if LLM attacker can jailbreak defender but I'm unsure of false positives or false negatives in this setup. - The "what makes a good attacker" might be heavily mis-representing the "truth" as models scale they might tend to do b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Topic Modeling
