PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
Huseein Jawad, Nicolas Brunel

TL;DR
This paper proposes a lightweight, black-box optimization framework to harden system prompts in LLMs by adding protective layers, significantly reducing prompt leakage from adversarial attacks while maintaining task utility.
Contribution
It introduces a formal utility-constrained optimization approach using an LLM as an optimizer to generate shield prompts that minimize leakage without sacrificing performance.
Findings
Optimized SHIELDs significantly reduce prompt leakage against extraction attacks.
The method outperforms existing defenses in effectiveness.
It maintains high task utility with minimal overhead.
Abstract
System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Topic Modeling
