Large Language Models as Carriers of Hidden Messages
Jakub Hoscilowicz, Pawel Popiolek, Jan Rudkowski, Jedrzej Bieniasz, Artur Janicki

TL;DR
This paper explores how fine-tuned large language models can embed hidden messages, demonstrates vulnerabilities in extracting these messages, and proposes a defense method that enhances security without harming model performance.
Contribution
It introduces the Unconditional Token Forcing attack and the UTFC defense, advancing understanding of hidden message security in LLMs and proposing practical countermeasures.
Findings
UTF effectively extracts hidden messages from fine-tuned LLMs.
UTFC prevents extraction attacks while maintaining LLM performance.
Embedding hidden messages can be exploited for covert communication.
Abstract
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query. Applications include LLM fingerprinting, where a unique identifier is embedded to verify licensing compliance, and steganography, where the LLM carries hidden messages disclosed through a trigger query. Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction through analysis of the LLM's output decoding process. We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates. We also present Unconditional Token Forcing Confusion (UTFC), a defense paradigm that makes hidden text resistant to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Topic Modeling
