Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che; Stephen Casper; Robert Kirk; Anirudh Satheesh; Stewart Slocum; Lev E McKinney; Rohit Gandikota; Aidan Ewart; Domenic Rosati; Zichu Wu; Zikui Cai; Bilal Chughtai; Yarin Gal; Furong Huang; Dylan Hadfield-Menell

arXiv:2502.05209·cs.CR·July 28, 2025

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

PDF

Open Access

TL;DR

This paper introduces model tampering attacks as a novel approach to evaluate large language models' risks and capabilities more thoroughly than traditional input-output methods, revealing challenges in suppressing harmful behaviors.

Contribution

It proposes using model tampering attacks for more rigorous LLM evaluation, demonstrating their effectiveness and limitations compared to existing methods.

Findings

01

Model resilience to attacks lies in a low-dimensional robustness subspace.

02

Tampering attack success predicts input-space attack success.

03

Unlearning harmful capabilities can be undone with minimal fine-tuning.

Abstract

Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Second, the behaviors identified during any particular input-output evaluation can only lower-bound the model's worst-possible-case input-output behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Network Security and Intrusion Detection