Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks
B.A. Levinstein, Daniel A. Herrmann

TL;DR
This paper critically evaluates existing methods for detecting beliefs in large language models, finds they fail empirically and conceptually, and discusses whether LLMs can or should have beliefs, proposing future research directions.
Contribution
The paper demonstrates the failure of current belief detection methods in LLMs and offers a nuanced discussion on the conceptual plausibility of beliefs in LLMs, guiding future research.
Findings
Existing belief detection methods do not generalize well.
Conceptual issues challenge the idea of LLMs having beliefs.
The paper suggests new directions for empirical and conceptual research.
Abstract
We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we evaluate two existing approaches, one due to Azaria and Mitchell (2023) and the other to Burns et al. (2022). We provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
Methodsfail
