Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng, Stuart Russell, Jacob Steinhardt

TL;DR
This paper introduces propositional probes to interpret language models by extracting their latent world states, revealing that models encode faithful world representations even when responses are unfaithful, aiding in monitoring and correction.
Contribution
The paper presents a novel method of propositional probes that decode latent world states from language model activations, demonstrating their effectiveness across contexts and in identifying unfaithful responses.
Findings
Propositional probes generalize to different languages and story formats.
Language models encode faithful world states despite unfaithful outputs.
Propositional decoding remains accurate under prompt injections and biases.
Abstract
Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
