HarmAug: Effective Data Augmentation for Knowledge Distillation of   Safety Guard Models

Seanie Lee; Haebin Seong; Dong Bok Lee; Minki Kang; Xiaoyin Chen,; Dominik Wagner; Yoshua Bengio; Juho Lee; Sung Ju Hwang

arXiv:2410.01524·cs.CL·February 25, 2025

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen,, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang

PDF

Open Access 1 Repo 1 Models

TL;DR

HarmAug is a data augmentation technique that improves the training of smaller safety guard models by generating diverse harmful instructions through jailbreaking prompts, enabling smaller models to match larger ones in detecting malicious queries.

Contribution

The paper introduces HarmAug, a novel data augmentation method that enhances knowledge distillation for safety guard models by generating diverse harmful instructions, reducing model size and computational cost.

Findings

01

HarmAug outperforms baseline augmentation methods.

02

A 435M parameter model trained with HarmAug matches larger models in F1 score.

03

HarmAug-trained model achieves higher AUPRC with less than 25% of the larger model's computation.

Abstract

Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imnotkind/HarmAug
pytorchOfficial

Models

🤗
hbseong/HarmAug-Guard
model· 114 dl· ♡ 40
114 dl♡ 40

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Risk and Safety Analysis