How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking   Unrelated Questions

Lorenzo Pacchiardi; Alex J. Chan; S\"oren Mindermann; Ilan Moscovitz,; Alexa Y. Pan; Yarin Gal; Owain Evans; Jan Brauner

arXiv:2309.15840·cs.CL·September 28, 2023·6 cites

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Lorenzo Pacchiardi, Alex J. Chan, S\"oren Mindermann, Ilan Moscovitz,, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a black-box lie detection method for large language models that uses unrelated follow-up questions and a logistic regression classifier, demonstrating high accuracy and broad generalization across models and scenarios.

Contribution

It presents a simple, effective lie detector for LLMs that does not require access to internal model details or ground-truth facts, and generalizes across different models and real-world situations.

Findings

01

High accuracy in detecting lies across various LLMs

02

Generalizes to unseen models and lying behaviors

03

Effective in real-life scenarios like sales

Abstract

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lorypack/llm-liedetector
noneOfficial

Videos

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions· slideslive

Taxonomy

TopicsTopic Modeling · Deception detection and forensic psychology · Adversarial Robustness in Machine Learning

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Residual Connection · Layer Normalization · Byte Pair Encoding · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dense Connections · Linear Warmup With Cosine Annealing · Linear Layer