Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, Ruoxi Jia

TL;DR
This paper demonstrates that safety alignment in large language models can be restored using only a single safety example, achieving effective correction without utility loss and with minimal computational cost.
Contribution
The authors introduce a method to fully recover safety alignment in LLMs with just one safety sample, revealing the low-rank safety gradient structure.
Findings
Safety can be restored with one example regardless of model size.
Convergence occurs within a few training epochs.
Method validated across five safety-aligned LLMs and multiple datasets.
Abstract
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of…
Peer Reviews
Decision·ICLR 2026 Poster
Strength: 1. The paper is overall clear and easy to follow. 2. The discovery of one sample safety recovery provides insight into how LLM safety fine-tuning works.
Weaknesses: 1. Weak connections between observations, reasons and theories. I overall does not feel convinced after reading Section4. Specifically, (1) I fail to see how Section 4.1 explains the main discovery that one sample is enough for safety recovery. Why does low-rankness of the gradients evaluated on few safety samples indicate that only these few samples are needed for safety recovery? In addition, these experiments are conducted on Llama-2-chat models, while the main discovery i
- This paper demonstrates the counterintuitive finding that a single safety example suffices for full recovery, and validates it across multiple model families (five LLMs, including Llama, Mistral, Qwen, and GPT-4.1), diverse attack scenarios (harmful injection, identity shifting, backdoor poisoning), and varying scales (10–1000 harmful examples), with a consistent reduction in the attack success rate. - The authors explain why rapid recovery is possible: an SVD analysis shows that safety gradie
Bi-level selection seems promising and coherent; however, whether the algorithm actually selects the best data remains unclear. Evaluated on only one candidate pool, resulting in one selected example, its applicability to other settings is uncertain.
1. The finding that a single sample can restore safety performance is very interesting. 2. The results are validated across various models and datasets. 3. The authors also provide theoretical explanations for why a single sample works, which makes the paper more solid. 4. The paper is well-structured and clearly written.
1. My main concern is how to choose such a sample, since the method appears to hinge on a carefully selected safety example. Is it found through extensive trial and error? 2. The bilevel optimization–based approach proposed in the paper doesn’t seem to be reflected in the experiments. 3. I’m also curious how sensitive performance is (across different models and datasets) to the choice of this sample.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Safety Systems Engineering in Autonomy
