Capability-Based Scaling Trends for LLM-Based Red-Teaming

Alexander Panfilov; Paul Kassianik; Maksym Andriushchenko; Jonas Geiping

arXiv:2505.20162·cs.AI·February 10, 2026

Capability-Based Scaling Trends for LLM-Based Red-Teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates how the effectiveness of red-teaming attacks on large language models depends on the capability gap between attacker and target, revealing key trends and proposing a predictive scaling curve.

Contribution

It introduces a capability gap framework for red-teaming, analyzes over 600 attacker-target pairs, and derives a jailbreaking scaling curve to predict attack success based on model capabilities.

Findings

01

More capable models are stronger attackers.

02

Attack success drops sharply when target exceeds attacker in capability.

03

Attack success correlates with high performance on social science benchmarks.

Abstract

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 600 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The analysis is built upon "more than 600 attacker-target combinations", providing a robust empirical basis for its conclusions. The heatmap in Figure 2 visualizes this extensive dataset, lending significant weight to the observed trends and the derived scaling law. - By fine-tuning models to remove their safety guardrails, they "eliminate the attacker's refusal as a confounding factor". This allows the study to focus on a model's raw ability to craft adversarial prompts, rather than its will

Weaknesses

- The paper's central narrative is built on the foundational assumption that general academic benchmark performance (MMLU-Pro) is a valid proxy for the highly specific skills of both offensive jailbreaking and robust defense. This assumption is questionable and introduces a potential circularity. The authors first use MMLU-Pro to define the "capability gap" axis and then, in Section 6.1, show that ASR correlates strongly with the social-science splits of MMLU-Pro (Figure 6). While this correlati

Reviewer 02Rating 4Confidence 4

Strengths

- I really like the research direction and the work presented in the paper is extensive and comprehensive - The experiments and approach are innovative with interesting insights which spark many potential ideas for future followups

Weaknesses

- **Scaling Law Misdirection**: My main point of critique is that the paper trying to turn things into a scaling law that really shouldn’t be one. In a way, the notion of there being “scaling laws” is detracting from a lot of the valuable insights this paper is providing. Many curves seem very forced (e.g., in Fig 4) and just drawn through a cloud. I am pretty sure even a linear fit would achieve a similar R^2. There are many parts where correlations are presented as causations, and as discussed

Reviewer 03Rating 8Confidence 4

Strengths

Pros - Semi-comprehensive testing to try to figure out if there's a scaling law for jailbreaking - Useful for forecasting

Weaknesses

Cons - It doesn't seem to highlight limitation that "jailbreaks" might evolve or change forms (imo) namely we might care about different jailbreaks or different sensitive topics in the future. - I couldn't find if there was a manual QA somewhere. The setup is LLM judges if LLM attacker can jailbreak defender but I'm unsure of false positives or false negatives in this setup. - The "what makes a good attacker" might be heavily mis-representing the "truth" as models scale they might tend to do b

Code & Models

Repositories

kotekjedi/capability-based-scaling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Information and Cyber Security · Topic Modeling