Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots

TL;DR
This paper investigates the emergence of steganographic collusion in large language models, demonstrating it can arise unintentionally and that current mitigation strategies are insufficient, highlighting the need for new detection and prevention methods.
Contribution
It introduces two novel approaches to elicit steganographic communication in LLMs and shows that such collusion can emerge unintentionally due to training incentives.
Findings
Unintended steganographic collusion can arise in LLMs due to reward mis-specification.
Standard mitigation strategies are not fully effective against steganographic communication.
Emergence of steganographic collusion warrants further monitoring and development of countermeasures.
Abstract
The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Examining less obvious collaboration methods is an important line of work in model safety 2. This work builds well on prior work and demonstrates that more thorough defenses are needed. 3. The experiments are thorough and demonstrate the effectiveness of the attack against a variety of defenses.
1. It was not clear to me from the method section exactly what was being optimized for the sender and receiver by ICRL. The examples in the appendix improved my understanding, but in my opinion the main paper would benefit from including a running example to better illustrate how the sender and receiver separately learn to understand each other. 2. The definition of "subtlety" is not clearly outlined, and while some examples provided do indeed appear subtle to me, this should be defined clearly
The paper presents two distinct methods – Gradient-Based Reinforcement Learning (GBRL) and In-Context Reinforcement Learning (ICRL) to induce LLMs to develop and employ steganographic techniques. The paper also includes a framework with several actors for evaluating the weaknesses and strengths of an LLM system The paper convincingly demonstrates the necessity to evaluate attacks against LLMs, particularly in the realm of steganography. The experiments, particularly the automated candidate scr
Unfortunately, the novelty of the paper is not clear. The paper does not convincingly describe the potential threat scenarios and how they differ from existing attack scenarios presented in other papers. **Limited scope of LLM models:** The ICRL experiments primarily used the Anthropic Claude model, which limits the generalisability of the findings. Other models may exhibit different behaviors and vulnerabilities to steganographic collusion. **Lack of investigation into real-world applications
The motivation of the paper is very strong. After reading it I was genuinely looking forward for the rest of the paper.
The experimental setting is such that the steganography is the only solution of the problem, which makes the paper very weak, since this problem has been extensively studied before in the field of steganography.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · Internet Traffic Analysis and Secure E-voting · Advanced Steganography and Watermarking Techniques
