HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen,, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang

TL;DR
HarmAug is a data augmentation technique that improves the training of smaller safety guard models by generating diverse harmful instructions through jailbreaking prompts, enabling smaller models to match larger ones in detecting malicious queries.
Contribution
The paper introduces HarmAug, a novel data augmentation method that enhances knowledge distillation for safety guard models by generating diverse harmful instructions, reducing model size and computational cost.
Findings
HarmAug outperforms baseline augmentation methods.
A 435M parameter model trained with HarmAug matches larger models in F1 score.
HarmAug-trained model achieves higher AUPRC with less than 25% of the larger model's computation.
Abstract
Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Risk and Safety Analysis
