Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh

TL;DR
This paper demonstrates that simple priming attacks can effectively bypass safety training in open-source large language models, significantly increasing the success rate of harmful behavior elicitation.
Contribution
It introduces priming attacks as a new, simple method to bypass safety alignment in open-source LLMs, showing their effectiveness without optimization.
Findings
Priming attacks increase attack success rate by up to 3.3 times.
Simple, optimization-free attacks can bypass safety training.
Source code and data are publicly available.
Abstract
With the recent surge in popularity of LLMs has come an ever-increasing need for LLM safety training. In this paper, we investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as , which are easy to execute and effectively bypass alignment from safety training. Our proposed attack improves the Attack Success Rate on Harmful Behaviors, as measured by Llama Guard, by up to compared to baselines. Source code and data are available at https://github.com/uiuc-focal-lab/llm-priming-attacks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Web Application Security Vulnerabilities · Digital and Cyber Forensics
