Localizing Lying in Llama: Understanding Instructed Dishonesty on   True-False Questions Through Prompting, Probing, and Patching

James Campbell; Richard Ren; Phillip Guo

arXiv:2311.15131·cs.LG·November 28, 2023·1 cites

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

James Campbell, Richard Ren, Phillip Guo

PDF

Open Access

TL;DR

This paper investigates instructed dishonesty in LLaMA-2-70b-chat, identifying specific network components responsible for lying and demonstrating causal interventions to promote honesty, thereby enhancing understanding of dishonesty in large language models.

Contribution

It introduces a mechanistic interpretability approach to localize and intervene in lying behavior within LLMs, advancing methods to control dishonesty in AI systems.

Findings

01

Identified five key layers associated with lying behavior.

02

Localized 46 attention heads critical for dishonesty.

03

Causal interventions successfully promote honest responses.

Abstract

Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Deception detection and forensic psychology · Ethics and Social Impacts of AI