SLM as Guardian: Pioneering AI Safety with Small Language Models

Ohjoon Kwon; Donghyeon Jeon; Nayoung Choi; Gyu-Hwung Cho; Changbong; Kim; Hyunwoo Lee; Inho Kang; Sun Kim; Taiwoo Park

arXiv:2405.19795·cs.CL·January 24, 2025·1 cites

SLM as Guardian: Pioneering AI Safety with Small Language Models

Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong, Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo Park

PDF

Open Access

TL;DR

This paper introduces a modular safety system using small language models to detect harmful queries and generate safeguards, reducing costs and maintaining helpfulness compared to larger models.

Contribution

It proposes a multi-task learning approach that combines harmful query detection and safeguard response generation within a small language model.

Findings

01

Achieves comparable or better safety performance than larger models

02

Reduces training costs and complexity

03

Maintains helpfulness while enhancing safety

Abstract

Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques