Scaling Trends in Language Model Robustness
Nikolaus Howe, Ian McKenzie, Oskar Hollinsworth, Micha{\l} Zajac, Tom Tseng, Aaron Tucker, Pierre-Luc Bacon, Adam Gleave

TL;DR
This paper investigates how scaling language models affects their robustness to adversarial attacks and defenses, revealing complex relationships between size, attack success, and training methods.
Contribution
It provides a comprehensive analysis of robustness scaling in language models, highlighting the effects of size on attack success and defense efficiency across various tasks and attack types.
Findings
Larger models are not inherently more robust without safety training.
Scaling improves sample efficiency in adversarial training but reduces compute efficiency.
Increasing attack compute enhances attack success against both defended and undefended models.
Abstract
Increasing model size has unlocked a dazzling array of capabilities in modern language models. At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust. As both attack and defense gain access to more compute, and as models become larger, what happens to robustness? We argue that to answer this question requires a \emph{scaling} approach, which we employ in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks. We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency. Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
