Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Mohammed Alshaalan; Miguel R. D. Rodrigues

arXiv:2605.19966·cs.LG·May 20, 2026

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Mohammed Alshaalan, Miguel R. D. Rodrigues

PDF

TL;DR

This paper introduces CPD, an online change-point detection method that effectively identifies optimization-based adversarial prompts in large language models by analyzing token entropy streams, outperforming existing perplexity-based detectors.

Contribution

The authors propose a novel, training-free, model-agnostic online detector for adversarial prompts that localizes suffix onset and improves detection metrics across multiple models.

Findings

01

CPD achieves higher F1 scores than windowed perplexity baselines.

02

It localizes adversarial suffixes with 79.6% accuracy.

03

Reduces guard calls by 17-22% while maintaining detection quality.

Abstract

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.