Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

TL;DR
This paper introduces a training-free self-refinement method with formatting to enhance the safety of language models against jailbreak attacks, outperforming existing defenses and reducing computational costs.
Contribution
It presents a novel formatting-based self-refinement approach that improves safety in language models without additional training, outperforming safety-aligned models in safety tasks.
Findings
Self-refine with formatting achieves superior safety performance.
The method reduces attack success rates in fewer iterations.
Non-safety-aligned LMs outperform safety-aligned LMs in safety tasks.
Abstract
Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Advanced Malware Detection Techniques
Methodstravel james
