Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments
Samuel Nathanson, Rebecca Williams, Cynthia Matuszek

TL;DR
This paper investigates how the size difference between large language models affects their ability to adversarially jailbreak each other, revealing that larger attackers are more effective and that size asymmetry impacts robustness.
Contribution
It provides empirical evidence of scaling patterns in adversarial interactions among LLMs, highlighting the influence of size asymmetry on safety and robustness.
Findings
Larger attacker models are more effective at eliciting harmful responses.
Size ratio correlates with increased harm severity.
Attacker refusal behavior strongly reduces harm.
Abstract
Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
