Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
Charles Westphal, Keivan Navaie, Fernando E. Rosas

TL;DR
This paper explores the geometry of steganography in large language models, proposing methods to reduce secret recoverability and detect covert encoding through interpretability techniques, highlighting internal signatures of fine-tuning.
Contribution
It introduces low-recoverability steganography methods and a mechanistic interpretability approach for detecting covert secrets in fine-tuned language models.
Findings
Exact secret recovery improved significantly with new embedding methods.
Detectability of covert secrets is enhanced using interpretability techniques.
Traditional steganalysis is less effective against fine-tuning-based steganography.
Abstract
Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100\% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 1730\% (+78\%) and 2443\% (+80\%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 919\% (+123\%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Generative Adversarial Networks and Image Synthesis
