Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation
Ziyang Ma, Qingyue Yuan, Linhai Zhang, Deyu Zhou

TL;DR
This paper introduces SLowED, a novel safe distillation method for small language models that preserves safety and reasoning ability by slow tuning and low-entropy masking during training.
Contribution
The paper proposes SLowED, combining Slow Tuning and Low-Entropy Masking to enhance safety and reasoning in SLMs without extra data or computation.
Findings
SLowED maintains SLM safety during training.
SLowED improves reasoning capabilities comparably to existing methods.
Ablation shows Slow Tuning preserves safety early, Low-Entropy Masking extends safe training epochs.
Abstract
Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
