Soft Prompt Threats: Attacking Safety Alignment and Unlearning in   Open-Source LLMs through the Embedding Space

Leo Schwinn; David Dobre; Sophie Xhonneux; Gauthier Gidel and; Stephan Gunnemann

arXiv:2402.09063·cs.LG·April 17, 2025·5 cites

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel and, Stephan Gunnemann

PDF

Open Access 1 Repo

TL;DR

This paper introduces embedding space attacks on open-source LLMs, revealing their effectiveness in bypassing safety measures and extracting deleted information, thus highlighting new security threats.

Contribution

It proposes a novel embedding space attack method and demonstrates its ability to bypass safety alignment and extract unlearned data in open-source LLMs.

Findings

01

Embedding space attacks bypass safety alignments more effectively.

02

Attacks can extract deleted information from unlearned models.

03

Embedding attacks pose significant security threats to open-source LLMs.

Abstract

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schwinnl/llm_embedding_attack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Security and Verification in Computing · Digital and Cyber Forensics