Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang

TL;DR
This paper explores using lightweight, general-purpose LLMs as real-time security judges to detect prompt attacks in chatbots, demonstrating their effectiveness and analyzing the benefits of model aggregation.
Contribution
It introduces a structured prompting approach for lightweight LLMs to serve as security judges and evaluates their performance in real-world deployment and with model aggregation.
Findings
Lightweight LLMs effectively detect prompt attacks in production.
Structured prompting improves judgment accuracy.
Model aggregation offers modest performance gains.
Abstract
Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Spam and Phishing Detection
