Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
Mohammed Alshaalan, Miguel R. D. Rodrigues

TL;DR
This paper introduces CPD, an online change-point detection method that effectively identifies optimization-based adversarial prompts in large language models by analyzing token entropy streams, outperforming existing perplexity-based detectors.
Contribution
The authors propose a novel, training-free, model-agnostic online detector for adversarial prompts that localizes suffix onset and improves detection metrics across multiple models.
Findings
CPD achieves higher F1 scores than windowed perplexity baselines.
It localizes adversarial suffixes with 79.6% accuracy.
Reduces guard calls by 17-22% while maintaining detection quality.
Abstract
Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
