ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

Amelia F. Hardy; Houjun Liu; Allie Griffith; Bernard Lange; Duncan Eddy; Mykel J. Kochenderfer

arXiv:2407.09447·cs.CL·September 24, 2025

ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

Amelia F. Hardy, Houjun Liu, Allie Griffith, Bernard Lange, Duncan Eddy, Mykel J. Kochenderfer

PDF

Open Access 1 Repo

TL;DR

ASTPrompter introduces a novel red-teaming method for language models that generates low-perplexity, high-success prompts, revealing more realistic and impactful vulnerabilities while maintaining low detectability.

Contribution

It presents a contrastive preference learning approach for low-perplexity prompt generation, significantly improving attack success rates and transferability across models.

Findings

01

Achieves 5.1x higher attack success rate on Llama-8.1B.

02

Generates prompts 2.1x more likely to occur naturally.

03

Transfers effectively to multiple other LLMs.

Abstract

Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sisl/astprompter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Dropout · Weight Decay · Multi-Head Attention · Dense Connections · Softmax · Linear Warmup With Cosine Annealing