FLRT: Fluent Student-Teacher Redteaming

T. Ben Thompson; Michael Sklar (Confirm Labs)

arXiv:2407.17447·cs.CL·October 2, 2024

FLRT: Fluent Student-Teacher Redteaming

T. Ben Thompson, Michael Sklar (Confirm Labs)

PDF

Open Access 1 Repo

TL;DR

This paper introduces FLRT, a novel method for creating fluent, human-like adversarial prompts that effectively jailbreak safety-tuned language models, outperforming previous techniques in success rate and fluency.

Contribution

The authors develop a distillation-based approach with enhanced optimization and fluency penalties to generate powerful, human-like prompts for model redteaming, improving over prior methods.

Findings

01

Achieves over 93% success rate on Llama-2-7B and Vicuna-7B.

02

Maintains low perplexity (<33) in generated prompts.

03

Transfers effectiveness to unseen tasks and black-box models.

Abstract

Many publicly available language models have been safety tuned to reduce the likelihood of toxic or liability-inducing text. To redteam or jailbreak these models for compliance with toxic requests, users and security analysts have developed adversarial prompting techniques. One attack method is to apply discrete optimization techniques to the prompt. However, the resulting attack strings are often gibberish text, easily filtered by defenders due to high measured perplexity, and may fail for unseen tasks and/or well-tuned models. In this work, we improve existing algorithms (primarily GCG and BEAST) to develop powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune, either in terms of output probabilities or internal activations. To encourage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Confirm-Solutions/flrt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCollaborative Teaching and Inclusion · EFL/ESL Teaching and Learning