Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti,, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn, Song, Bo Li, Dan Hendrycks, Mantas Mazeika

TL;DR
This paper introduces TAR, a novel method to embed tamper-resistant safeguards into open-weight LLMs, making it difficult for adversaries to remove safety measures even after extensive fine-tuning, thereby enhancing security.
Contribution
We propose TAR, a new approach that significantly improves the tamper-resistance of safeguards in open-weight LLMs, addressing vulnerabilities of existing methods.
Findings
TAR greatly enhances tamper-resistance of safeguards.
TAR preserves the benign capabilities of LLMs.
Extensive evaluations show TAR's robustness against fine-tuning attacks.
Abstract
Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after hundreds of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that progress on…
Peer Reviews
Decision·ICLR 2025 Poster
1. **Significance** of the problem. The paper addresses an important and challenging problem of defending open-weight large language models against finetuning attacks. In authors words: "_This problem has been considered very challenging and by some intractable, as no method has yet provided substantial robustness to these attacks. However, making progress on this problem would provide a valuable tool to regulators and model developers by ameliorating the dual-use dilemma of open-weight models_"
1. **Evaluation** against out-of-distribution attacks. - My main concern is that the defense might be effective mostly against observed attacks, and it could break against other unseen attacks. For example, Table 4 in Appendix shows that "Retain $\rightarrow$ Forget" attack breaks the defense if it is not included in the training phase. Figure 4, and Figure 8 from Appendix show that PEFT attacks are more effective than Full Parameter attacks (in case of Biosecurity, PEFT attacks break the propo
**About contribution** + The experimental results shown in Table 1 are significant enough to validate the main claims of this paper. + The proposed method is intuitive. By providing detailed discussions of the related works, it is not hard to understand why the authors designed the algorithms as presented, even for readers not familiar with the defense of LLMs. **About novelty** According to Section 2, this paper proposes the first defense method for autoregressive LLMs against tampering at
**About presentation** + The authors do not discuss the cost of the experiments, including time cost and GPU memory cost. Section B.4 mentioned that the experiments use 8 A100 with 80GB GPU memory. What is the minimum requirement for the experiments? + I suggest including a statement of contribution to make this paper easier to follow.
The security issues related to open LLMs are both important and intriguing; The authors present a series of solutions to address these security threats; and experiments validate the performance of the proposed mechanisms.
1. Insufficient sustainability. This paper proposes the integration of adversarial learning and meta-learning to enhance the effectiveness of defense mechanisms, making it difficult for attackers to compromise them in a short period. However, this effectiveness actually depends on the diversity of attack types included in the training data for optimizing eqn.1. In other words, the resilience of the proposed mechanism may be superficial and does not guarantee the security of open-weight LLMs. Fur
Code & Models
Videos
The ChatGPT Paradox: Impressive Yet Incomplete· youtube
Taxonomy
TopicsRadiation Effects in Electronics · Electrostatic Discharge in Electronics · Electrical Fault Detection and Protection
