Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
Jie Li, Yi Liu, Chongyang Liu, Xiaoning Ren, Ling Shi, Weisong Sun,, Yinxing Xue

TL;DR
This paper explores self and cross-model distillation techniques to improve the refusal capabilities of large language models, effectively reducing unsafe outputs and enhancing security against toxic prompts.
Contribution
It introduces novel self-distilling and cross-model distilling methods for LLM alignment, demonstrating significant improvements in refusal rates and safety.
Findings
Models with uniform refusal patterns are more secure.
Cross-model distillation achieves refusal rates close to 94.51%.
Distillation methods significantly reduce unsafe content.
Abstract
Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude, and Meta's LLaMa have shown remarkable capabilities in text generation. However, their susceptibility to toxic prompts presents significant security challenges. This paper investigates alignment techniques, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to mitigate these risks. We conduct an empirical study on refusal patterns across nine LLMs, revealing that models with uniform refusal patterns, such as Claude3, exhibit higher security. Based on these findings, we propose self-distilling and cross-model distilling methods to enhance LLM security. Our results show that these methods significantly improve refusal rates and reduce unsafe content, with cross-model distilling achieving refusal rates close to Claude3's 94.51%. These findings underscore the potential of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Fuzzy Logic and Control Systems · Machine Learning and Algorithms
MethodsAttention Is All You Need · Cosine Annealing · Byte Pair Encoding · Attention Dropout · Weight Decay · Dropout · Adam · Linear Warmup With Cosine Annealing · Linear Layer · Dense Connections
