Towards eliciting latent knowledge from LLMs with mechanistic interpretability
Bartosz Cywi\'nski, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper explores methods to uncover hidden secret knowledge in language models using mechanistic interpretability techniques, demonstrating promising results in a controlled setting and highlighting future research directions.
Contribution
It introduces automated interpretability strategies to elicit secret information from language models, advancing the understanding of hidden knowledge retrieval.
Findings
Both black-box and interpretability methods successfully elicited the secret word.
Mechanistic interpretability techniques like logit lens and autoencoders are effective.
The approach offers a promising step towards safer and more transparent language models.
Abstract
As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
