FLRT: Fluent Student-Teacher Redteaming
T. Ben Thompson, Michael Sklar (Confirm Labs)

TL;DR
This paper introduces FLRT, a novel method for creating fluent, human-like adversarial prompts that effectively jailbreak safety-tuned language models, outperforming previous techniques in success rate and fluency.
Contribution
The authors develop a distillation-based approach with enhanced optimization and fluency penalties to generate powerful, human-like prompts for model redteaming, improving over prior methods.
Findings
Achieves over 93% success rate on Llama-2-7B and Vicuna-7B.
Maintains low perplexity (<33) in generated prompts.
Transfers effectiveness to unseen tasks and black-box models.
Abstract
Many publicly available language models have been safety tuned to reduce the likelihood of toxic or liability-inducing text. To redteam or jailbreak these models for compliance with toxic requests, users and security analysts have developed adversarial prompting techniques. One attack method is to apply discrete optimization techniques to the prompt. However, the resulting attack strings are often gibberish text, easily filtered by defenders due to high measured perplexity, and may fail for unseen tasks and/or well-tuned models. In this work, we improve existing algorithms (primarily GCG and BEAST) to develop powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune, either in terms of output probabilities or internal activations. To encourage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCollaborative Teaching and Inclusion · EFL/ESL Teaching and Learning
