Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks   with Self-Refinement

Heegyu Kim; Sehyun Yuk; Hyunsouk Cho

arXiv:2402.15180·cs.LG·February 28, 2024·3 cites

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

PDF

Open Access

TL;DR

This paper introduces a training-free self-refinement method with formatting to enhance the safety of language models against jailbreak attacks, outperforming existing defenses and reducing computational costs.

Contribution

It presents a novel formatting-based self-refinement approach that improves safety in language models without additional training, outperforming safety-aligned models in safety tasks.

Findings

01

Self-refine with formatting achieves superior safety performance.

02

The method reduces attack success rates in fewer iterations.

03

Non-safety-aligned LMs outperform safety-aligned LMs in safety tasks.

Abstract

Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Advanced Malware Detection Techniques

Methodstravel james