Self-Guard: Empower the LLM to Safeguard Itself

Zezhong Wang; Fangkai Yang; Lu Wang; Pu Zhao; Hongru Wang; Liang Chen,; Qingwei Lin; Kam-Fai Wong

arXiv:2310.15851·cs.CL·March 25, 2024·1 cites

Self-Guard: Empower the LLM to Safeguard Itself

Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen,, Qingwei Lin, Kam-Fai Wong

PDF

Open Access 1 Video

TL;DR

Self-Guard is a novel method that enables LLMs to autonomously detect and prevent harmful content, effectively defending against jailbreak attacks without degrading performance or causing over-sensitivity.

Contribution

The paper introduces Self-Guard, a two-stage approach that combines content assessment and self-detection, improving robustness against jailbreaks while maintaining model performance.

Findings

01

Self-Guard effectively defends against jailbreak attacks.

02

It does not degrade LLM performance after safety training.

03

It mitigates over-sensitivity issues in LLMs.

Abstract

The jailbreak attack can bypass the safety measures of a Large Language Model (LLM), generating harmful content. This misuse of LLM has led to negative societal consequences. Currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. Safety training focuses on further training LLM to enhance its safety. On the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. However, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. Safeguards have proven to be of limited help. To tackle these issues, we propose a novel approach called Self-Guard, which combines the strengths of both safety methods. Self-Guard includes two stages. In the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SELF-GUARD: Empower the LLM to Safeguard Itself· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Adversarial Robustness in Machine Learning