DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu, Boran Wang

TL;DR
DROJ is a novel prompt optimization method that manipulates embeddings to successfully bypass safety measures in large language models, revealing vulnerabilities and suggesting ways to improve model robustness.
Contribution
The paper introduces DROJ, a new embedding-level prompt optimization technique for adversarial jailbreak attacks on LLMs, demonstrating its effectiveness and proposing mitigation strategies.
Findings
DROJ achieves 100% attack success rate on LLaMA-2-7b-chat.
The attack can bypass safety mechanisms effectively.
Repetitive responses are a side effect of the attack, mitigated by a helpfulness prompt.
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning
