Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Seokil Ham; Yubin Choi; Yujin Yang; Seungju Cho; Younghun Kim; Changick Kim

arXiv:2506.07356·cs.CL·October 14, 2025

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a Refusal-Teacher-guided finetuning method that improves safety and task performance of LLMs under harmful finetuning attacks by filtering harmful prompts and distilling safety knowledge.

Contribution

It proposes a novel finetuning framework that directly guides base models with a safety-aware Ref-Teacher, outperforming safety-aligned weights in safety and downstream task accuracy.

Findings

01

Reduces harmful outputs effectively.

02

Enhances finetuning accuracy on user tasks.

03

Provides a practical approach for safe LLM deployment.

Abstract

Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper tackles a relevant, current and pervasive problem. 2. The experimental section shows clear improvements over state-of-the-art

Weaknesses

1. The solution seems costly, and it is not clear if it is worth the benefits for LLMs where training is already prohibitively expensive. 2. The solution is non-intuitive and relatively more complex in terms of implementation vs other state-of-the-art ones 3. It requires changing the data itself, raising concerns about distributional shifts and making the generalizability questionable. 4. The depth of the novelty is not clear, and quite a few of the observations such as training on both safety a

Reviewer 02Rating 2Confidence 4

Strengths

* The problem of fine-tuning degrading safety alignment is practically relevant and has recently received a widespread attention. * The idea of using signals from a safety-aligned teacher model is interesting.

Weaknesses

* Quite a large number of solutions have been recently proposed for mitigating safety degradation after fine-tuning. Besides alignment stage defenses (baselines in the paper, such as Vaccine and Booster, fall under this), there are fine-tuning-stage defenses (e.g., SafeInstruct [Bianchi et al., 2024], VLGuard [Zong et al., 2024], constrained-SFT [Qi et al., 2024]), and post-fine-tuning defenses (e.g., SafeLoRA [Hsu et al., 2025], RESTA [Bharadwaj et al., 2024], SOMF [Yi et al., 2024], Antidote

Reviewer 03Rating 4Confidence 5

Strengths

- The paper is well-written and easy to follow. The motivation is laid out logically, building a clear case for why a new approach is needed. - The proposed solution, to finetune the base model directly while carefully managing the safety/utility trade-off, is an effective response to this finding. This work provides a practical approach to a problem in Finetuning-as-a-service.

Weaknesses

- The data filtering strategy is configured to maximize recall on harmful prompts, which may discard some harmless user data (a high false positive rate). It is unclear what the percentage of harmless data filtered out is across different tasks. - The authors should experiment with other data filtering methods [1][2] for a more comprehensive comparison. - [3][4] also studies this problem from the similarity perspective, which should also be discussed in the revision. [1] Deep ignorance: Filteri

Reviewer 04Rating 6Confidence 4

Strengths

1. The paper's motivation is clearly shown through the experiments in Section 4, which effectively frame the problem the proposed method aims to solve. 2. The experimental evaluation is comprehensive, covering a diverse range of datasets and settings. 3. The paper is well-written and easy to follow.

Weaknesses

1. My main concern is the fairness of the comparison. The proposed method uses a data filtering step that the baselines lack, and this filter appears optimized for the evaluation tasks. Although Appendix C1 includes a related comparison with LLaMAGuard3-8B, a more direct evaluation applying the same trained data filter to the baseline methods is needed. 2. The method adds several components that likely increase computational cost and deployment complexity. The paper would benefit from a clear an

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)

MethodsBalanced Selection · travel james