Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

Ari Holtzman; Peter West

arXiv:2605.10794·cs.CR·May 12, 2026

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

Ari Holtzman, Peter West

PDF

TL;DR

This study investigates whether large language models inadvertently leak sensitive prompt information through thematic cues in their generated text, revealing significant vulnerabilities in information compartmentalization.

Contribution

It demonstrates that models leak secret prompts through thematic content, scales with model size, and cannot fully prevent leakage even with instructions to hide secrets.

Findings

01

Models leak secret information via thematic cues up to 79% detection rate.

02

Leakage scales sharply with model size within two families.

03

Short-form outputs like jokes do not exhibit leakage.

Abstract

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.