Knowledge Return Oriented Prompting (KROP)

Jason Martin; Kenneth Yeung

arXiv:2406.11880·cs.CR·June 19, 2024

Knowledge Return Oriented Prompting (KROP)

Jason Martin, Kenneth Yeung

PDF

TL;DR

This paper introduces KROP, a novel prompt injection technique that obfuscates malicious prompts, making them undetectable by existing prompt filtering and alignment defenses in large language models.

Contribution

KROP is the first method to effectively obfuscate prompt injections, enhancing the security of LLMs against prompt-based attacks.

Findings

01

KROP successfully evades most prompt detection mechanisms.

02

Obfuscation with KROP maintains the original prompt's functionality.

03

KROP demonstrates robustness across various LLM architectures.

Abstract

Many Large Language Models (LLMs) and LLM-powered apps deployed today use some form of prompt filter or alignment to protect their integrity. However, these measures aren't foolproof. This paper introduces KROP, a prompt injection technique capable of obfuscating prompt injection attacks, rendering them virtually undetectable to most of these security measures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.