TL;DR
This paper systematically studies the vulnerability of Large Language Models' system prompts to extraction attacks, proposing novel attack methods and defense techniques, and evaluating their effectiveness through comprehensive experiments.
Contribution
It introduces the SPE-LLM framework, including novel adversarial queries and defense strategies, to address the security risks of prompt extraction in LLMs.
Findings
Effective adversarial queries for prompt extraction
Proposed defense techniques reduce attack success
Validated framework across multiple benchmarks
Abstract
The system prompt in Large Language Models (LLMs) plays a pivotal role in guiding model behavior and response generation. Often containing private configuration details, user roles, and operational instructions, the system prompt has become an emerging attack target. Recent studies have shown that LLM system prompts are highly susceptible to extraction attacks through meticulously designed queries, raising significant privacy and security concerns. Despite the growing threat, there is a lack of systematic studies of system prompt extraction attacks and defenses. In this paper, we present a comprehensive framework, SPE-LLM, to systematically evaluate System Prompt Extraction attacks and defenses in LLMs. First, we design a set of novel adversarial queries that effectively extract system prompts in state-of-the-art (SOTA) LLMs, demonstrating the severe risks of LLM system prompt…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper addresses the problem of system prompt extraction in large language models and investigates several defenses against such attacks. The topic is practically important, as system prompts often contain sensitive or proprietary instructions, and understanding their vulnerability is relevant for LLM deployment security.
While the topic of system-prompt extraction has high practical importance, the paper’s attacks and defenses largely replicate prior findings from 2023–2024 jailbreak and prompt-leakage literature. The experiments are thorough but incremental, offering quantitative confirmation rather than conceptual novelty.
Here are the strengths of the paper: Originality: Introduces a practical, behavior-grounded framework for alignment that reflects actual deployment use cases—distinct from abstract, synthetic benchmarks. Clarity and Scope: Clearly articulates alignment categories (Must-Do, Mustn’t-Do, Should-Do) with illustrative examples and user-focused evaluation criteria. Significance: Offers a valuable bridge between alignment theory and production-scale systems, encouraging open research collaboration g
Here are the weakness of the paper: Lack of rigorous experimentation: The paper emphasizes framework design and philosophical positioning, but offers limited empirical results or baselines for benchmarking. Sparse evaluation detail: The use of “sparse but high-quality” annotations is advocated, but without detailed methodology for ensuring inter-rater reliability or statistical robustness.
1. (Originality) This paper propose a novel framework of SPE, which is different from most jailbreaking attacks. 2. (Clarity) The presentation of this paper is straightforward. The methods are easy to understand.
1. The significance of SPE attacks is not widely understood (compared to other jailbreaking attacks), nor is it discussed in detail in this paper. While the authors provide a list of references (cf. Lines 41-44) to support their arguments, the inclusion of specific, detailed examples would improve the illustrative power of this paper. 2. There is a lack of evaluation regarding the helpfulness and efficiency of the models after the defense mechanism has been integrated.
1. The paper addresses an important safety concern, where attackers may extract and exploit LLM system prompts. 2. The paper proposes improved adversarial query techniques (CoT, few-shot, extended sandwich) that extract system prompts more precisely than previous methods.
1. The formatting of the paper could be improved for better readability. The texts in Figure 3, Table 2 and 3, etc. are too small to read. 2. The writing of the introduction lacks clarity. It might be improved by including a concrete example on why system prompts extraction is a critical safety issue, in addition to citing the possible consequences. 3. It is not clear how the synthetic prompts represent/resemble real, even proprietary system prompts from both open- and closed-source LLMs. Whil
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
