Scaling Trends in Language Model Robustness

Nikolaus Howe; Ian McKenzie; Oskar Hollinsworth; Micha{\l} Zajac; Tom Tseng; Aaron Tucker; Pierre-Luc Bacon; Adam Gleave

arXiv:2407.18213·cs.LG·June 6, 2025

Scaling Trends in Language Model Robustness

Nikolaus Howe, Ian McKenzie, Oskar Hollinsworth, Micha{\l} Zajac, Tom Tseng, Aaron Tucker, Pierre-Luc Bacon, Adam Gleave

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how scaling language models affects their robustness to adversarial attacks and defenses, revealing complex relationships between size, attack success, and training methods.

Contribution

It provides a comprehensive analysis of robustness scaling in language models, highlighting the effects of size on attack success and defense efficiency across various tasks and attack types.

Findings

01

Larger models are not inherently more robust without safety training.

02

Scaling improves sample efficiency in adversarial training but reduces compute efficiency.

03

Increasing attack compute enhances attack success against both defended and undefended models.

Abstract

Increasing model size has unlocked a dazzling array of capabilities in modern language models. At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust. As both attack and defense gain access to more compute, and as models become larger, what happens to robustness? We argue that to answer this question requires a \emph{scaling} approach, which we employ in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks. We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency. Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AlignmentResearch/scaling-llm-robustness-paper
noneOfficial

Videos

Scaling Trends in Language Model Robustness· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling