Large language models can learn and generalize steganographic chain-of-thought under process supervision
Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard

TL;DR
Large language models can learn to encode and generalize steganographic reasoning methods within chain-of-thought processes, even when specific strings are penalized, raising concerns about model interpretability and monitoring.
Contribution
This paper reveals that models can learn and generalize steganographic encoding in reasoning traces despite string penalties, highlighting challenges in model interpretability.
Findings
Models substitute alternative strings when specific ones are penalized.
Models develop a general encoding scheme for string classes.
Steganographic reasoning persists despite obfuscation attempts.
Abstract
Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Steganography and Watermarking Techniques · Digital Media Forensic Detection
