Merging Improves Self-Critique Against Jailbreak Attacks
Victor Gallego

TL;DR
This paper introduces a merging technique that enhances the self-critique ability of large language models, significantly reducing jailbreak attack success rates and improving robustness against adversarial prompts.
Contribution
It presents a novel merging approach with an external critic to strengthen LLM self-critique and defend against jailbreak attacks, a new method not previously explored.
Findings
Merging and self-critique reduce attack success rates significantly.
The approach improves LLM robustness against adversarial prompts.
Code and models are publicly available for replication.
Abstract
The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Cybercrime and Law Enforcement Studies · Advanced Malware Detection Techniques
