ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
Amelia F. Hardy, Houjun Liu, Allie Griffith, Bernard Lange, Duncan Eddy, Mykel J. Kochenderfer

TL;DR
ASTPrompter introduces a novel red-teaming method for language models that generates low-perplexity, high-success prompts, revealing more realistic and impactful vulnerabilities while maintaining low detectability.
Contribution
It presents a contrastive preference learning approach for low-perplexity prompt generation, significantly improving attack success rates and transferability across models.
Findings
Achieves 5.1x higher attack success rate on Llama-8.1B.
Generates prompts 2.1x more likely to occur naturally.
Transfers effectively to multiple other LLMs.
Abstract
Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Dropout · Weight Decay · Multi-Head Attention · Dense Connections · Softmax · Linear Warmup With Cosine Annealing
