Merging Improves Self-Critique Against Jailbreak Attacks

Victor Gallego

arXiv:2406.07188·cs.CL·July 16, 2024

Merging Improves Self-Critique Against Jailbreak Attacks

Victor Gallego

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces a merging technique that enhances the self-critique ability of large language models, significantly reducing jailbreak attack success rates and improving robustness against adversarial prompts.

Contribution

It presents a novel merging approach with an external critic to strengthen LLM self-critique and defend against jailbreak attacks, a new method not previously explored.

Findings

01

Merging and self-critique reduce attack success rates significantly.

02

The approach improves LLM robustness against adversarial prompts.

03

Code and models are publicly available for replication.

Abstract

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vicgalle/merging-self-critique-jailbreaks
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Cybercrime and Law Enforcement Studies · Advanced Malware Detection Techniques