Decoupled Alignment for Robust Plug-and-Play Adaptation

Haozheng Luo; Jiahao Yu; Wenxin Zhang; Jialong Li; Jerry Yao-Chieh Hu,; Xinyu Xing; Han Liu

arXiv:2406.01514·cs.CL·June 7, 2024·1 cites

Decoupled Alignment for Robust Plug-and-Play Adaptation

Haozheng Luo, Jiahao Yu, Wenxin Zhang, Jialong Li, Jerry Yao-Chieh Hu,, Xinyu Xing, Han Liu

PDF

Open Access 4 Reviews

TL;DR

This paper presents a low-resource, plug-and-play method for aligning large language models using knowledge distillation, improving safety without extensive fine-tuning or reinforcement learning.

Contribution

It introduces a novel knowledge distillation approach that extracts alignment information from well-aligned models and applies it to unaligned models efficiently.

Findings

01

Increases average defense success rate by 14.41% on harmful questions dataset.

02

Achieves up to 51.39% success rate without degrading model performance.

03

Applicable to 17 different unaligned pre-trained LLMs.

Abstract

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

Introduces training-free, plug-and-play alignment via targeted memory transplantation, a novel framing that removes retraining cost and offers a new perspective on alignment localization in LLMs Demonstrates strong empirical rigor. It is tested on 17 models across Llama, Gemma, and Mistral families, with consistent DSR gains and detailed ablations confirming that MLP gate layers encode alignment knowledge. Presents a transparent workflow (delta-debugging search + layer transplantation), clear

Weaknesses

DAPA explicitly relies on a pre-aligned teacher and cannot align a model from scratch. This raises questions about its motivation — if obtaining a safe teacher model still requires costly alignment methods such as RLHF or DPO, the overall cost advantage may diminish. Moreover, identifying alignment-critical MLP components through delta-debugging also incurs inference cost, yet the paper does not quantify it. It would be helpful to report the actual computational cost of the search phase and comp

Reviewer 02Rating 4Confidence 4

Strengths

**S1: Novel use of memory editing for alignment-localization.** The paper takes the “knowledge editing / memory editing” line of work and repurposes it for safety rather than for factual editing: they experimentally probe hidden states and MLP submodules to locate the neurons / projections that carry alignment information, and then transplant only those parts to the unaligned model. Using memory editing to attribute and transfer alignment-related neurons is, in my view, a fresh angle compared wi

Weaknesses

**W1: Questionable motivation / unclear problem setting.** The paper assumes a scenario where “well-aligned” LLMs are already available in the same model family, yet the proposed solution is still to distill and transfer alignment knowledge from the well-aligned model to a “shallowly aligned” or “shadow-aligned” one. This raises a basic motivation question: if a robustly aligned model is already accessible, why is distillation the preferable path instead of using the aligned model directly? The

Reviewer 03Rating 6Confidence 2

Strengths

1. A wide array of models were considered (3 families of LLMs). 2. The evaluation criteria is satisfactorily defined. The experimentation for both refusal and performance was done on multiple datasets for the sake of generalizability. 3. This paper serves as a finding paper in identifying the underlying correlation between a refusal behaviour and core components of the LLM. I am basing my score of weak acceptance based on the experimentation towards this finding. Though I still have reservation

Weaknesses

1. My major concern is towards the practicality of DAPA in practical LLM alignment. For instance still editing of the unaligned model still requires the presence of an aligned model (llama 7B chat for llama 7B base etc) presence of such an aligned model it defeats the purpose of DAPA for alignment unless that alignment had caused a significant degradation in certain other aspects. If that is the case those instances should be studied for the validation of the method as a define. For instance bas

Reviewer 04Rating 2Confidence 4

Strengths

The paper is well-structured, and the experimental evaluation encompasses a comprehensive range of model series.

Weaknesses

* **Limited Contribution.** The overall contribution of the paper appears limited. The proposed *layer-wise replacement* strategy is relatively coarse-grained, whereas recent model-editing approaches (e.g., [1, 2]) enable fine-grained control over specific knowledge or behavioral components within LLMs. These related works are neither discussed nor compared, leaving the novelty and significance of the proposed method insufficiently demonstrated. * **Unclear Methodological Description.** The

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsKnowledge Distillation