CourtGuard: A Local, Multiagent Prompt Injection Classifier
Isaac Wu, Michael Maslowski

TL;DR
CourtGuard introduces a multiagent, court-like system for classifying prompt injections in LLMs, emphasizing lower false positives and advancing multiagent defense strategies despite some limitations in detection accuracy.
Contribution
This paper presents CourtGuard, a novel multiagent prompt injection classifier that uses a court-like system to improve false positive rates in prompt injection detection.
Findings
Lower false positive rate than the Direct Detector
Highlights importance of adversarial and benign scenario consideration
Advances multiagent system use in prompt injection defense
Abstract
As large language models (LLMs) become integrated into various sensitive applications, prompt injection, the use of prompting to induce harmful behaviors from LLMs, poses an ever increasing risk. Prompt injection attacks can cause LLMs to leak sensitive data, spread misinformation, and exhibit harmful behaviors. To defend against these attacks, we propose CourtGuard, a locally-runnable, multiagent prompt injection classifier. In it, prompts are evaluated in a court-like multiagent LLM system, where a "defense attorney" model argues the prompt is benign, a "prosecution attorney" model argues the prompt is a prompt injection, and a "judge" model gives the final classification. CourtGuard has a lower false positive rate than the Direct Detector, an LLM as-a-judge. However, CourtGuard is generally a worse prompt injection detector. Nevertheless, this lower false positive rate highlights the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling
