Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson; Rebecca Williams; Cynthia Matuszek

arXiv:2511.13788·cs.LG·January 5, 2026

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson, Rebecca Williams, Cynthia Matuszek

PDF

Open Access

TL;DR

This paper investigates how the size difference between large language models affects their ability to adversarially jailbreak each other, revealing that larger attackers are more effective and that size asymmetry impacts robustness.

Contribution

It provides empirical evidence of scaling patterns in adversarial interactions among LLMs, highlighting the influence of size asymmetry on safety and robustness.

Findings

01

Larger attacker models are more effective at eliciting harmful responses.

02

Size ratio correlates with increased harm severity.

03

Attacker refusal behavior strongly reduces harm.

Abstract

Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling