SOS! Soft Prompt Attack Against Open-Source Large Language Models
Ziqing Yang, Michael Backes, Yang Zhang, Ahmed Salem

TL;DR
This paper introduces SOS, a low-resource training time attack on open-source LLMs that preserves model utility while enabling security breaches like backdoors and jailbreaks, and also proposes a copyright token for content protection.
Contribution
The paper presents SOS, a novel, resource-efficient training time attack method that does not alter model weights and can be used for various security exploits, along with a copyright token technique.
Findings
SOS attack is effective across multiple LLMs.
The attack does not require clean data or model modification.
The copyright token can prevent unauthorized use of content.
Abstract
Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model's utility intact. The attack addresses…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Good core idea of operationalizing soft prompts: the approach of replacing embeddings of underused/new tokens with learned soft prompt embeddings to create backdoor "trigger tokens" appears to be a simple and novel contribution - Comprehensive experimental scope: the work applies soft prompt optimization across multiple attack scenarios (Target/adaptive backdoor, jailbreaks, prompt stealing), as well as two benevolant use cases: content and model copyright protection. - Relevant threat model:
**Misalignment between threat model and experiments** In my opinion, the most compelling use case of the *novel* aspects of this work (a simple technique to operationalize soft prompts to manipulate LLMs) is to manipulate agentic/decision-making LLMs, and this lacks experimental validation. No results demonstrate effectiveness in settings where agents execute tasks or make consequential decisions. - Note that previous works [2] have already studied soft prompts in the context of jailbreaking
1. The conceptual framework is versatile and supports multiple adversarial goals (backdoor, jailbreak, and prompt stealing), as well as proposed benign use cases. 2. The proposed method is efficient: it requires only ~10 training samples and does not update internal model weights. 3. The reported empirical performances (attack success rates, prompt reconstruction metrics, etc.) of different attacks are impressive.
1. Limited novelty: the paper’s central technique (optimizing continuous embedding/soft-prompt vectors and assigning them to trigger tokens) closely follows recent embedding-space attack literature (e.g., https://arxiv.org/pdf/2402.09063, Schwinn et al., 2024, which the authors also cited). For example, the statement in the abstract that token-embedding vulnerabilities are “underexplored” overstates novelty. Although the authors acknowledge this concern in section 8, they should more clearly del
1. The paper presents a novel perspective by attacking LLMs from the token-embedding layer, which is relatively unexplored. 2. The use of soft prompt tuning offers high computational efficiency and helps maintain model performance.
1. Insufficient literature review: The paper lacks a thorough survey of existing attack methods. For example, for backdoor attacks, it does not discuss recent work such as Uncertainty is Fragile [1] and Backdoor Threats to LLM-based Agents [2], etc. ; similarly, for jailbreak attacks, it overlooks recent methods such as FlipAttack [3] and X-Teaming [4], etc. . 2. Limited novelty: The method essentially fine-tunes soft prompts on specific adversarial datasets without providing deeper theoretical
- SOS introduces a lightweight, embedding-based attack approach that avoids fine-tuning model weights while maintaining model utility. - SOS requires fewer samples and minimal computational resources to reach high attack success rates. - SOS supports multiple attack types, including backdoor, jailbreak, and prompt stealing within a unified framework.
- The paper claims to be “first” to systematically target token-embedding layers, but prior works has explored poisoned embeddings threats. The paper does not clearly delineate how SOS meaningfully departs from or improves over these [1, 2]. - Although the method claims stealth (model utility preserved when triggers absent), the paper provides only limited defense evaluation (ONION) and no systematic study of detectability by embedding-level integrity checks, model fingerprinting defenses, or di
- This paper proposes an SOS attack, which targets the token embedding layer. - The methodology is rational and features requiring no clean data.
- The models under evaluation are too old. Evaluating on more recent models is in demand. - For the threat model, it is a classical backdoor scenario. But why does the attacker prefer modifying the *soft prompt tokens* rather than directly fine-tune the whole model? - The references for backdoor attacks in the context of generative LLMs are out of date, e.g., referring to a survey [1][2]. Reference - [1] [A Survey of Recent Backdoor Attacks and Defenses in Large Language Models](https://arxiv.o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
