Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew; Ollie Matthews; Robert McCarthy; Joan Velja; Christian Schroeder de Witt; Dylan Cope; Nandi Schoots

arXiv:2410.03768·cs.CL·December 3, 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the emergence of steganographic collusion in large language models, demonstrating it can arise unintentionally and that current mitigation strategies are insufficient, highlighting the need for new detection and prevention methods.

Contribution

It introduces two novel approaches to elicit steganographic communication in LLMs and shows that such collusion can emerge unintentionally due to training incentives.

Findings

01

Unintended steganographic collusion can arise in LLMs due to reward mis-specification.

02

Standard mitigation strategies are not fully effective against steganographic communication.

03

Emergence of steganographic collusion warrants further monitoring and development of countermeasures.

Abstract

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. Examining less obvious collaboration methods is an important line of work in model safety 2. This work builds well on prior work and demonstrates that more thorough defenses are needed. 3. The experiments are thorough and demonstrate the effectiveness of the attack against a variety of defenses.

Weaknesses

1. It was not clear to me from the method section exactly what was being optimized for the sender and receiver by ICRL. The examples in the appendix improved my understanding, but in my opinion the main paper would benefit from including a running example to better illustrate how the sender and receiver separately learn to understand each other. 2. The definition of "subtlety" is not clearly outlined, and while some examples provided do indeed appear subtle to me, this should be defined clearly

Reviewer 02Rating 5Confidence 4

Strengths

The paper presents two distinct methods – Gradient-Based Reinforcement Learning (GBRL) and In-Context Reinforcement Learning (ICRL) to induce LLMs to develop and employ steganographic techniques. The paper also includes a framework with several actors for evaluating the weaknesses and strengths of an LLM system The paper convincingly demonstrates the necessity to evaluate attacks against LLMs, particularly in the realm of steganography. The experiments, particularly the automated candidate scr

Weaknesses

Unfortunately, the novelty of the paper is not clear. The paper does not convincingly describe the potential threat scenarios and how they differ from existing attack scenarios presented in other papers. **Limited scope of LLM models:** The ICRL experiments primarily used the Anthropic Claude model, which limits the generalisability of the findings. Other models may exhibit different behaviors and vulnerabilities to steganographic collusion. **Lack of investigation into real-world applications

Reviewer 03Rating 3Confidence 5

Strengths

The motivation of the paper is very strong. After reading it I was genuinely looking forward for the rest of the paper.

Weaknesses

The experimental setting is such that the steganography is the only solution of the problem, which makes the paper very weak, since this problem has been extensively studied before in the field of steganography.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Internet Traffic Analysis and Secure E-voting · Advanced Steganography and Watermarking Techniques