Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le; Benjamin Goh; Quy Anh Tang

arXiv:2603.25176·cs.CL·March 27, 2026

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang

PDF

Open Access

TL;DR

This paper explores using lightweight, general-purpose LLMs as real-time security judges to detect prompt attacks in chatbots, demonstrating their effectiveness and analyzing the benefits of model aggregation.

Contribution

It introduces a structured prompting approach for lightweight LLMs to serve as security judges and evaluates their performance in real-world deployment and with model aggregation.

Findings

01

Lightweight LLMs effectively detect prompt attacks in production.

02

Structured prompting improves judgment accuracy.

03

Model aggregation offers modest performance gains.

Abstract

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Spam and Phishing Detection