DROJ: A Prompt-Driven Attack against Large Language Models

Leyang Hu; Boran Wang

arXiv:2411.09125·cs.CL·November 15, 2024

DROJ: A Prompt-Driven Attack against Large Language Models

Leyang Hu, Boran Wang

PDF

Open Access 1 Repo

TL;DR

DROJ is a novel prompt optimization method that manipulates embeddings to successfully bypass safety measures in large language models, revealing vulnerabilities and suggesting ways to improve model robustness.

Contribution

The paper introduces DROJ, a new embedding-level prompt optimization technique for adversarial jailbreak attacks on LLMs, demonstrating its effectiveness and proposing mitigation strategies.

Findings

01

DROJ achieves 100% attack success rate on LLaMA-2-7b-chat.

02

The attack can bypass safety mechanisms effectively.

03

Repetitive responses are a side effect of the attack, mitigated by a helpfulness prompt.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leon-leyang/llm-safeguard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning