Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Helena Casademunt; Bartosz Cywi\'nski; Khoi Tran; Arya Jakkli; Samuel Marks; Neel Nanda

arXiv:2603.05494·cs.LG·March 11, 2026

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Helena Casademunt, Bartosz Cywi\'nski, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda

PDF

Open Access

TL;DR

This paper investigates how censored large language models, trained to suppress politically sensitive topics, can be used to test and improve honesty elicitation and lie detection techniques, revealing partial success and persistent falsehoods.

Contribution

It introduces a novel testbed using censored LLMs to evaluate honesty and lie detection methods, demonstrating their effectiveness and transferability across models.

Findings

01

Few-shot prompting and fine-tuning improve truthful responses.

02

Prompting models to classify responses yields high lie detection accuracy.

03

No technique completely eliminates false responses.

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Misinformation and Its Impacts · Topic Modeling