Investigating Bias Representations in Llama 2 Chat via Activation   Steering

Dawn Lu; Nina Rimsky

arXiv:2402.00402·cs.CL·February 2, 2024·2 cites

Investigating Bias Representations in Llama 2 Chat via Activation Steering

Dawn Lu, Nina Rimsky

PDF

Open Access

TL;DR

This paper investigates societal biases in Llama 2 7B Chat, revealing persistent gender bias even after RLHF, and introduces activation steering as a method to probe and mitigate these biases.

Contribution

It demonstrates the use of activation steering to analyze and influence bias representations in LLMs, highlighting the impact of RLHF on bias similarity and proposing red-teaming strategies.

Findings

01

Gender bias persists after RLHF

02

Bias correlates with refusal tendencies

03

RLHF increases bias similarity across forms

Abstract

We address the challenge of societal bias in Large Language Models (LLMs), focusing on the Llama 2 7B Chat model. As LLMs are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. Our approach employs activation steering to probe for and mitigate biases related to gender, race, and religion. This method manipulates model activations to direct responses towards or away from biased outputs, utilizing steering vectors derived from the StereoSet dataset and custom GPT4 generated gender bias prompts. Our findings reveal inherent gender bias in Llama 2 7B Chat, persisting even after Reinforcement Learning from Human Feedback (RLHF). We also observe a predictable negative correlation between bias and the model's tendency to refuse responses. Significantly, our study uncovers that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · AI in Service Interactions · Digital Communication and Language