Prompt Inject Detection with Generative Explanation as an Investigative Tool
Jonathan Pan, Swee Liang Wong, Yidi Yuan, Xin Wei Chia

TL;DR
This paper proposes using large language models to generate explanations for detected prompt injections, aiding AI security investigators in identifying and assessing adversarial prompts more effectively.
Contribution
It introduces a novel approach leveraging LLMs' text generation to explain prompt inject detections, enhancing investigative triage and assessment capabilities.
Findings
LLMs can generate meaningful explanations for prompt inject detections.
Generated explanations assist investigators in triaging prompts.
The approach improves the efficiency of prompt inject investigations.
Abstract
Large Language Models (LLMs) are vulnerable to adversarial prompt based injects. These injects could jailbreak or exploit vulnerabilities within these models with explicit prompt requests leading to undesired responses. In the context of investigating prompt injects, the challenge is the sheer volume of input prompts involved that are likely to be largely benign. This investigative challenge is further complicated by the semantics and subjectivity of the input prompts involved in the LLM conversation with its user and the context of the environment to which the conversation is being carried out. Hence, the challenge for AI security investigators would be two-fold. The first is to identify adversarial prompt injects and then to assess whether the input prompt is contextually benign or adversarial. For the first step, this could be done using existing AI security solutions like guardrails…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
