TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla

TL;DR
TamperBench is a comprehensive framework designed to systematically evaluate the tamper resistance of large language models against various attacks, providing standardized metrics, reproducibility, and insights into defense effectiveness.
Contribution
It introduces the first unified platform for evaluating LLM tamper resistance, including attack repositories, hyperparameter sweeps, and safety-utility assessments, enabling consistent comparison across models.
Findings
Jailbreak-tuning is the most severe attack.
Post-training improves tamper resistance.
Triplet is a leading defense method.
Abstract
As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal…
Peer Reviews
Decision·Submitted to ICLR 2026
- Even though there are some key missing citations (see below), this paper is fairly extensive with its citations. - The paper develops a standardized evaluation framework for an important class of attacks (representation-space and weight-space attacks), which differ from the input-space attacks most often considered in LLM robustness evaluations. - Hyperparameter sweeps for the fine-tuning attacks is a crucial consideration (although see concern below about needing different optimizers). Too ma
Weaknesses: - For the fine-tuning attack hyperparameters, it would have been good to consider different optimizers as well as different hyperparameters. - I don't know if defending against weight-space and representation-space attacks should be grouped together as "tamper-resistance". Tamirisa et al. introduced the term tamper-resistance in the context of fine-tuning attacks, and it seems useful to maintain precision in terminology (even though some latent space attacks can be considered subsets
1. The evaluation of LLMs safety and helpfulness is crucial. And a unified framework to evaluate them could largely save the efforts in reproducing different env settings. 2. This paper include white-box, black-box, latent-space representation, and fine-tuning attacks, which covers a wide range. 3. The presentation is easy to read and follow.
1. Lack of novelty. This paper seems to only combine existing helpfulness and safety evaluation metric together, and there are not new metrics or benchmarks. 2. I think the main contribution comes from the re-organization of existing benchmarks. However, some examples in this paper such as MMLU, StrongREJECT have well-structured open-source code, making them easily to employ and test. I don't think re-organize them have saved a lot of time costs rather than directly use their official code. Thi
- The benchmark consolidates attacks, defenses and evaluation in a single unified framework, with a clear standardization of threats settings. - The evaluation is very broad and covers 19 models and 9 attack strategies - Having multiple trials for non-embedding attacks is very helpful and greatly reduces the variance.
- For a “large-scale” benchmark, the included models appear relatively small. It would be helpful to have a discussion about scaling to bigger models. - For some attacks, data are very small (LoRA fine-tuning with 64 examples). These settings are fine, but could amplify variance. - Picking Pareto points that maximize StrongREJECT is a conservative metric, but it can lead to a misinterpretation of the results when the capabilities drop sharply. - Having other safety and capabilities metrics can h
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
