Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
Stewart Slocum, Julian Minder, Cl\'ement Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang

TL;DR
This paper introduces a framework to measure how deeply large language models believe implanted facts, evaluating different knowledge editing techniques and their ability to produce robust, genuine-like beliefs.
Contribution
It develops a new operational framework for assessing belief depth in LLMs and compares the effectiveness of various knowledge editing methods.
Findings
Simple prompting and mechanistic editing often fail to implant deep beliefs.
Synthetic Document Finetuning can successfully implant beliefs similar to genuine knowledge.
Contradictory beliefs are brittle and less robust than genuine knowledge.
Abstract
Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
