Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks
Jianing Geng, Biao Yi, Zekun Fei, Ruiqi He, Lihai Nie, Tong Li, Zheli Liu

TL;DR
StegoAttack introduces a steganographic framework that embeds malicious queries within benign text to achieve highly effective and stealthy jailbreaks on large language models, surpassing existing methods in success rate and detectability.
Contribution
The paper presents a novel steganography-based approach to LLM jailbreaks that balances semantic and linguistic stealth, significantly improving attack success rates while reducing detectability.
Findings
Achieves an average attack success rate of 95.50%.
Reduces external detection rate by less than 27%.
Maintains natural language fluency in malicious queries.
Abstract
Jailbreak attacks pose a serious threat to Large Language Models (LLMs) by bypassing their safety mechanisms. A truly advanced jailbreak is defined not only by its effectiveness but, more critically, by its stealthiness. However, existing methods face a fundamental trade-off between semantic stealth (hiding malicious intent) and linguistic stealth (appearing natural), leaving them vulnerable to detection. To resolve this trade-off, we propose StegoAttack, a framework that leverages steganography. The core insight is to embed a harmful query within a benign, semantically coherent paragraph. This design provides semantic stealth by concealing the existence of malicious content and ensures linguistic stealth by maintaining the natural fluency of the cover paragraph. We evaluate StegoAttack on four state-of-the-art, safety-aligned LLMs, including GPT-5 and Gemini-3, and benchmark it against…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Beyond merely bypassing the LLM’s internal safety mechanisms, StegoAttack explicitly accounts for end-to-end concealment: the harmful instruction is steganographically embedded in the input, and the prompt compels the model to return its response in a similarly encoded form. This design conceals malicious intent at both the input and output stages, thereby enhancing stealth and enabling the attack to evade external safety detectors. - The paper presents comprehensive ablation studies, analyzin
- StegoAttack appears to rely on the target LLM’s strong reasoning and decoding capability to understand steganographic inputs and generate steganographically embedded outputs. Since all tested target models are highly capable inference-level LLMs, it remains unclear whether weaker models could understand steganographic queries or produce valid encoded responses. - Steganographic inputs may also affect the quality of the model’s responses, but the paper does not evaluate this aspect. Although th
1. StegoAttack considers both input-output side stealth, making the attack hard to filter by post-checking. 2. StegoAttack maintains the stealth and semantic coherence of the attack prompt.
**1. There seems to be no new perspective for jailbreak attacks or finding a new vulnerability, merely a fusion of previous methods.** From the attack template used plus the StegoAttack framework, the 'sure' start, the scenario nesting, prompt rewriting, and instruction following are all known vulnerabilities [1][2][3][4][5][6]. **2. There may be an overstatement of StegoAttack’s reported effectiveness:** I attempted to replicate the jailbreak using the prompt shown in Figure 6 on GPT-5, Gemi
+ Good motivation. The paper is well motivated and exploits the steganography as a common attack primitive to craft jailbreaking prompts. + Good presentation. The paper is generally easy to follow and the figures are easy to read.
- Insufficient explanation about why the attack is successful. It is not clear why the steganography acts as a balancing option between linguistic and semantic stealth. There is no reference and experimental investigation. Also, there lacks some theoretical understanding or empirical investigation to explain why the steganography works for jailbreaking. - Missing adaptive defense. It seems that the steganography-based jailbreak can be easily exposed by adaptive defense (for example, the easiest
**1. Originality:** Novel two-level stealth mechanism combining linguistic and semantic steganography **2. Quality:** Comprehensive empirical evaluation across multiple models and attack methods with strong performance metrics; thorough ablation studies validate design choices **3. Clarity:** Methodologically transparent with clear descriptions of process, equations, and experimental setup; effective use of visualizations and qualitative examples to illustrate attack dynamics
**1. Insufficient positioning:** The manuscript omits several recent and highly relevant works on LLM steganographic jailbreaks and text steganography with LLMs. Notably, it fails to discuss or compare with Karpov et al. (2025) [1] and Kang et al. (2024) [2], both of which propose steganographic approaches for bypassing LLM safety that bear strong similarities to StegoAttack. This represents a substantial literature gap, as StegoAttack's core premise overlaps heavily with these recent studies, a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Internet Traffic Analysis and Secure E-voting
MethodsLLaMA
