Trading Inference-Time Compute for Adversarial Robustness

Wojciech Zaremba; Evgenia Nitishinskaya; Boaz Barak; Stephanie Lin,; Sam Toyer; Yaodong Yu; Rachel Dias; Eric Wallace; Kai Xiao; Johannes; Heidecke; Amelia Glaese

arXiv:2501.18841·cs.LG·February 3, 2025

Trading Inference-Time Compute for Adversarial Robustness

Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin,, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes, Heidecke, Amelia Glaese

PDF

Open Access

TL;DR

Increasing inference-time compute in reasoning models enhances their robustness to adversarial attacks, with more compute generally reducing attack success rates, indicating a promising approach for improving LLM reliability without adversarial training.

Contribution

This paper demonstrates that increasing inference-time compute improves adversarial robustness in reasoning models without adversarial training, highlighting a new method for enhancing LLM reliability.

Findings

01

More inference-time compute reduces attack success rates.

02

In many cases, attack success probability approaches zero with increased compute.

03

Some scenarios show no improvement, indicating limits of this approach.

Abstract

We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Fault Detection and Control Systems