CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer
Yue Zhao, Yujia Gong, Ruigang Liang, Shenchen Zhu, Kai Chen, Xuejing Yuan, Wangjun Zhang

TL;DR
This paper introduces Cross-Model Neuron Transfer (CNT), a post-hoc method for reusing safety-related functionalities across large language models by transferring minimal neurons, improving safety adaptation with minimal performance loss.
Contribution
CNT enables modular safety functionality transfer between LLMs at the neuron level, supporting both addition and deletion of safety features in a post-hoc manner.
Findings
Achieves safety functionality transfer with less than 1% performance degradation.
Outperforms five baseline methods across multiple safety tasks.
Demonstrates generality and effectiveness in diverse LLMs.
Abstract
The widespread deployment of large language models (LLMs) calls for post-hoc methods that can flexibly adapt models to evolving safety requirements. Meanwhile, the rapidly expanding open-source LLM ecosystem has produced a diverse collection of models that already exhibit various safety-related functionalities. This motivates a shift from constructing safety functionality from scratch to reusing existing functionality from external models, thereby avoiding costly data collection and training procedures. In this paper, we present Cross-Model Neuron Transfer (CNT), a post-hoc method that reuses safety-oriented functionality by transferring a minimal subset of neurons from an open-source donor LLM to a target LLM. By operating at the neuron level, CNT enables modular function-level adaptation, supporting both function addition andfunction deletion. We evaluate CNT on seven popular LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
