Monitoring Latent World States in Language Models with Propositional   Probes

Jiahai Feng; Stuart Russell; Jacob Steinhardt

arXiv:2406.19501·cs.CL·December 10, 2024

Monitoring Latent World States in Language Models with Propositional Probes

Jiahai Feng, Stuart Russell, Jacob Steinhardt

PDF

Open Access 1 Repo

TL;DR

This paper introduces propositional probes to interpret language models by extracting their latent world states, revealing that models encode faithful world representations even when responses are unfaithful, aiding in monitoring and correction.

Contribution

The paper presents a novel method of propositional probes that decode latent world states from language model activations, demonstrating their effectiveness across contexts and in identifying unfaithful responses.

Findings

01

Propositional probes generalize to different languages and story formats.

02

Language models encode faithful world states despite unfaithful outputs.

03

Propositional decoding remains accurate under prompt injections and biases.

Abstract

Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiahai-feng/prop-probes-iclr
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling