Prompt Inject Detection with Generative Explanation as an Investigative   Tool

Jonathan Pan; Swee Liang Wong; Yidi Yuan; Xin Wei Chia

arXiv:2502.11006·cs.CR·February 18, 2025

Prompt Inject Detection with Generative Explanation as an Investigative Tool

Jonathan Pan, Swee Liang Wong, Yidi Yuan, Xin Wei Chia

PDF

Open Access

TL;DR

This paper proposes using large language models to generate explanations for detected prompt injections, aiding AI security investigators in identifying and assessing adversarial prompts more effectively.

Contribution

It introduces a novel approach leveraging LLMs' text generation to explain prompt inject detections, enhancing investigative triage and assessment capabilities.

Findings

01

LLMs can generate meaningful explanations for prompt inject detections.

02

Generated explanations assist investigators in triaging prompts.

03

The approach improves the efficiency of prompt inject investigations.

Abstract

Large Language Models (LLMs) are vulnerable to adversarial prompt based injects. These injects could jailbreak or exploit vulnerabilities within these models with explicit prompt requests leading to undesired responses. In the context of investigating prompt injects, the challenge is the sheer volume of input prompts involved that are likely to be largely benign. This investigative challenge is further complicated by the semantics and subjectivity of the input prompts involved in the LLM conversation with its user and the context of the environment to which the conversation is being carried out. Hence, the challenge for AI security investigators would be two-fold. The first is to identify adversarial prompt injects and then to assess whether the input prompt is contextually benign or adversarial. For the first step, this could be done using existing AI security solutions like guardrails…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics