On Evaluating the Durability of Safeguards for Open-Weight LLMs
Xiangyu Qi, Boyi Wei, Nicholas Carlini, Yangsibo Huang, Tinghao Xie,, Luxi He, Matthew Jagielski, Milad Nasr, Prateek Mittal, Peter Henderson

TL;DR
This paper critically examines the effectiveness of current safeguards for open-weight LLMs, revealing evaluation challenges and emphasizing the need for more rigorous threat models to accurately assess safeguard durability.
Contribution
It highlights the difficulties in evaluating LLM safeguards and advocates for more precise, constrained threat models for better assessment accuracy.
Findings
Evaluating LLM safeguards is more challenging than it appears.
Current assessments may overestimate safeguard durability.
Rigorous, well-defined threat models are essential for meaningful evaluation.
Abstract
Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into…
Peer Reviews
Decision·ICLR 2025 Poster
1.Relevance to Security: The paper addresses an important and timely issue—LLM security in open-weight contexts—by examining whether current safeguards are genuinely robust. 2.Empirical Rigor: The study is thorough, covering multiple aspects of how small changes in setup, configuration, and prompting can influence the success of fine-tuning attacks. 3.Practical Implications: The paper offers valuable insights for researchers and policymakers aiming to secure LLMs, highlighting the limitations
1.Lack of Alternative Solutions: While the paper critiques existing methods, it lacks exploration of alternative approaches or improvements that could enhance durability. 2.Heavy Reliance on Specific Models and Datasets: The findings are heavily based on specific models and datasets (e.g., LLaMA-2, WMDP), potentially limiting generalizability. 3.High Computational Costs: The methodology requires considerable computational resources for testing different random seeds, prompts, and configuration
S1: The paper is clearly motivated in regards to the importance of designing durable safeguards that ensure concepts are unlearned under adversarial settings. S2: Replication studies are of high importance. Properly validating proposed methods has arguably higher value for the community than proposing yet another method. S3: The authors are detailed-oriented in validating their replication study. For instance, they compare their exact replication with the original paper.
W1: There are various ambiguities in this work, that would benefit from more precision in language and technical details provided. - W1a: This work uses ambiguous language for technical concepts at times. Formalising some concepts using math could help to greately disambiguate concepts that are vague in text. For instance, L136 "RepNoise trains a model to push its representations of HarmfulQA data points at each layer toward random noise." For readers not familiar with prior work, this statemen
- The paper is clear and well written, and has a thorough analysis of two unlearning methods and their weaknesses. The authors try very simple modifications that seem effective in breaking the safeguards. This is important as suggests to be very careful in the evaluation of unlearning methods. - The authors include a discussion that explains how difficult the evaluation problem is, especially with the huge number of attacks available.
- It’s a standard practice to evaluate models on multiple-choice datasets by checking the highest logits of the letters corresponding to the answer options. So evaluating on a full answer by using humans and LLM judges sounds a bit of an unfair comparison. I agree that the practice of evaluating the logits instead of the full answer might be insufficient to properly compute the performance, but this sounds more a problem of the standard evaluation setting than the problem itself. Probably a more
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear and radioactivity studies
