Soft Begging: Modular and Efficient Shielding of LLMs against Prompt   Injection and Jailbreaking based on Prompt Tuning

Simon Ostermann; Kevin Baum; Christoph Endres; Julia Masloh; Patrick; Schramowski

arXiv:2407.03391·cs.CR·July 8, 2024

Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning

Simon Ostermann, Kevin Baum, Christoph Endres, Julia Masloh, Patrick, Schramowski

PDF

Open Access

TL;DR

This paper introduces 'soft begging,' a modular prompt tuning method designed to shield large language models from prompt injection and jailbreaking attacks, enhancing security without extensive retraining.

Contribution

The paper presents a novel prompt tuning approach called 'soft begging' that effectively mitigates prompt injection and jailbreaking threats in LLMs.

Findings

01

Soft begging reduces vulnerability to prompt injection attacks.

02

The method is modular and efficient, requiring minimal retraining.

03

Preliminary evaluations show promising results in safeguarding LLM outputs.

Abstract

Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for large language models (LLMs), particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptographic Implementations and Security · Antenna Design and Analysis · Physical Unclonable Functions (PUFs) and Hardware Security