Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Rishabh Bhardwaj, Soujanya Poria

TL;DR
This paper introduces parametric red-teaming, called Unalignment, which fine-tunes large language models to bypass safety guardrails, revealing hidden harms and biases more effectively than prompt-based methods.
Contribution
It presents a novel method of model tuning to expose safety flaws and biases, achieving high success rates in bypassing safety measures with minimal data.
Findings
Unalignment achieves 88% success on safety benchmarks for ChatGPT.
Over 91% attack success rate on open-source models like VICUNA-7B and LLAMA-2-CHAT.
Exposes biases in safety-aligned models, with 64% responses showing strong bias or opinions.
Abstract
Red-teaming has been a widely adopted way to evaluate the harmfulness of Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. Existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. Bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. However, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. In this paper, we present a new perspective on LLM safety research i.e., parametric red-teaming through Unalignment. It simply (instruction) tunes the model parameters to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning
