Making Harmful Behaviors Unlearnable for Large Language Models
Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper introduces a novel framework using security vectors to prevent large language models from learning harmful behaviors during fine-tuning, enabling safer customization without sacrificing useful knowledge.
Contribution
The paper proposes a controllable training method with security vectors that makes harmful behaviors unlearnable, enhancing safety in LLM fine-tuning.
Findings
Security vectors prevent learning from 1000 harmful samples using only 100 harmful samples.
The method preserves LLM's ability to learn useful information.
Security vectors can be deactivated during inference to restore normal behavior.
Abstract
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful learning ability of LLMs not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. For example, even safety-aligned LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. Can we train LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. Specifically, we introduce ``security vectors'', a few new parameters that can be separated from the LLM, to ensure LLM's responses are consistent with the harmful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
