Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning
Simon Ostermann, Kevin Baum, Christoph Endres, Julia Masloh, Patrick, Schramowski

TL;DR
This paper introduces 'soft begging,' a modular prompt tuning method designed to shield large language models from prompt injection and jailbreaking attacks, enhancing security without extensive retraining.
Contribution
The paper presents a novel prompt tuning approach called 'soft begging' that effectively mitigates prompt injection and jailbreaking threats in LLMs.
Findings
Soft begging reduces vulnerability to prompt injection attacks.
The method is modular and efficient, requiring minimal retraining.
Preliminary evaluations show promising results in safeguarding LLM outputs.
Abstract
Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for large language models (LLMs), particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptographic Implementations and Security · Antenna Design and Analysis · Physical Unclonable Functions (PUFs) and Hardware Security
