Language Model Unalignment: Parametric Red-Teaming to Expose Hidden   Harms and Biases

Rishabh Bhardwaj; Soujanya Poria

arXiv:2310.14303·cs.CL·November 14, 2023·2 cites

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

Rishabh Bhardwaj, Soujanya Poria

PDF

Open Access 1 Repo

TL;DR

This paper introduces parametric red-teaming, called Unalignment, which fine-tunes large language models to bypass safety guardrails, revealing hidden harms and biases more effectively than prompt-based methods.

Contribution

It presents a novel method of model tuning to expose safety flaws and biases, achieving high success rates in bypassing safety measures with minimal data.

Findings

01

Unalignment achieves 88% success on safety benchmarks for ChatGPT.

02

Over 91% attack success rate on open-source models like VICUNA-7B and LLAMA-2-CHAT.

03

Exposes biases in safety-aligned models, with 64% responses showing strong bias or opinions.

Abstract

Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

declare-lab/red-instruct
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning