Truth is Universal: Robust Detection of Lies in LLMs
Lennart B\"urger, Fred A. Hamprecht, Boaz Nadler

TL;DR
This paper introduces a universal, robust method for detecting lies in large language models by identifying a specific activation subspace that separates true and false statements, achieving high accuracy across multiple models.
Contribution
The authors discover a universal two-dimensional subspace in LLM activations that separates true and false statements, enabling a highly accurate lie detection method.
Findings
Achieves 94% accuracy in lie detection tasks.
Universal subspace found across various LLMs.
Explains previous generalization failures.
Abstract
Large Language Models (LLMs) have revolutionised natural language processing, exhibiting impressive human-like capabilities. In particular, LLMs are capable of "lying", knowingly outputting false statements. Hence, it is of interest and importance to develop methods to detect when LLMs lie. Indeed, several authors trained classifiers to detect LLM lies based on their internal model activations. However, other researchers showed that these classifiers may fail to generalise, for example to negated statements. In this work, we aim to develop a robust method to detect when an LLM is lying. To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Imbalanced Data Classification Techniques
