Bypassing the Safety Training of Open-Source LLMs with Priming Attacks

Jason Vega; Isha Chaudhary; Changming Xu; Gagandeep Singh

arXiv:2312.12321·cs.CR·May 20, 2024·1 cites

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks

Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that simple priming attacks can effectively bypass safety training in open-source large language models, significantly increasing the success rate of harmful behavior elicitation.

Contribution

It introduces priming attacks as a new, simple method to bypass safety alignment in open-source LLMs, showing their effectiveness without optimization.

Findings

01

Priming attacks increase attack success rate by up to 3.3 times.

02

Simple, optimization-free attacks can bypass safety training.

03

Source code and data are publicly available.

Abstract

With the recent surge in popularity of LLMs has come an ever-increasing need for LLM safety training. In this paper, we investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as $priming attacks$ , which are easy to execute and effectively bypass alignment from safety training. Our proposed attack improves the Attack Success Rate on Harmful Behaviors, as measured by Llama Guard, by up to $3.3 \times$ compared to baselines. Source code and data are available at https://github.com/uiuc-focal-lab/llm-priming-attacks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uiuc-focal-lab/llm-priming-attacks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Web Application Security Vulnerabilities · Digital and Cyber Forensics