Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal; Keivan Navaie; Fernando E. Rosas

arXiv:2601.22818·cs.CR·February 2, 2026

Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Charles Westphal, Keivan Navaie, Fernando E. Rosas

PDF

Open Access

TL;DR

This paper explores the geometry of steganography in large language models, proposing methods to reduce secret recoverability and detect covert encoding through interpretability techniques, highlighting internal signatures of fine-tuning.

Contribution

It introduces low-recoverability steganography methods and a mechanistic interpretability approach for detecting covert secrets in fine-tuned language models.

Findings

01

Exact secret recovery improved significantly with new embedding methods.

02

Detectability of covert secrets is enhanced using interpretability techniques.

03

Traditional steganalysis is less effective against fine-tuning-based steganography.

Abstract

Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100\% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17 $\to$ 30\% (+78\%) and 24 $\to$ 43\% (+80\%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9 $\to$ 19\% (+123\%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Generative Adversarial Networks and Image Synthesis