Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing
Ari Holtzman, Peter West

TL;DR
This study investigates whether large language models inadvertently leak sensitive prompt information through thematic cues in their generated text, revealing significant vulnerabilities in information compartmentalization.
Contribution
It demonstrates that models leak secret prompts through thematic content, scales with model size, and cannot fully prevent leakage even with instructions to hide secrets.
Findings
Models leak secret information via thematic cues up to 79% detection rate.
Leakage scales sharply with model size within two families.
Short-form outputs like jokes do not exhibit leakage.
Abstract
Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
