Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Bartosz Cywi\'nski; Emil Ryd; Senthooran Rajamanoharan; Neel Nanda

arXiv:2505.14352·cs.LG·May 21, 2025

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Bartosz Cywi\'nski, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

PDF

Open Access 1 Repo

TL;DR

This paper explores methods to uncover hidden secret knowledge in language models using mechanistic interpretability techniques, demonstrating promising results in a controlled setting and highlighting future research directions.

Contribution

It introduces automated interpretability strategies to elicit secret information from language models, advancing the understanding of hidden knowledge retrieval.

Findings

01

Both black-box and interpretability methods successfully elicited the secret word.

02

Mechanistic interpretability techniques like logit lens and autoencoders are effective.

03

The approach offers a promising step towards safer and more transparent language models.

Abstract

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

emilryd/eliciting-secrets
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies