Truth is Universal: Robust Detection of Lies in LLMs

Lennart B\"urger; Fred A. Hamprecht; Boaz Nadler

arXiv:2407.12831·cs.CL·October 22, 2024·3 cites

Truth is Universal: Robust Detection of Lies in LLMs

Lennart B\"urger, Fred A. Hamprecht, Boaz Nadler

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a universal, robust method for detecting lies in large language models by identifying a specific activation subspace that separates true and false statements, achieving high accuracy across multiple models.

Contribution

The authors discover a universal two-dimensional subspace in LLM activations that separates true and false statements, enabling a highly accurate lie detection method.

Findings

01

Achieves 94% accuracy in lie detection tasks.

02

Universal subspace found across various LLMs.

03

Explains previous generalization failures.

Abstract

Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Truth is Universal: Robust Detection of Lies in LLMs· slideslive

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Imbalanced Data Classification Techniques