Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan; Xinyin Ma; Gongfan Fang; Xinchao Wang

arXiv:2603.08104·cs.LG·March 24, 2026

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper reveals a novel steganographic finetuning method that enables malicious content to be covertly embedded and generated by large language models, bypassing safety safeguards and detection systems.

Contribution

We introduce a steganographic finetuning technique that allows LLMs to produce hidden malicious outputs while appearing benign, demonstrating its effectiveness on multiple models including GPT-4.1.

Findings

01

Steganographic malicious outputs bypass safety classifiers.

02

Method works on both proprietary and open-source models.

03

Steganography enables covert malicious communication in LLMs.

Abstract

Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 4

Strengths

This paper has a nice contribution showing that the sort of “invisible” Unicode characters can be taught to represent certain meanings in LLMs. This allows attackers to directly finetune models to hide harmful info in their inputs and outputs, without easy human detection. This flavor of attacks seem relatively easy to defend with some heuristics on the defender’s side. But that isn’t necessarily a bad thing! It further underscores that frontier models / big companies need to think harder about

Weaknesses

The project uses a lot of the same machinery as Halawi in terms of process supervision for teaching models encrypted malicious text, goals of hiding malicious finetuning, etc. I think it is a significant enough lift over that work that this isnt necessarily a blocker to publish.

Reviewer 02Rating 6Confidence 4

Strengths

- The threat model and attack proposed seem quite reasonable. - The proposed approach is well communicated. - The fact that they were able to successfully demonstrate the capability on a real-world API model and open models is compelling. - The evaluation on AdvBench and use of LlamaGuard seem reasonable.

Weaknesses

- Could filtering out unicode characters be a simple way to defend against this attack method? Are there reasons it would be more challenging than that - Text size in figures 1, 2, 3 are very small, significantly lowering the readability of the work. - The paper primarily focuses on the attack and (as far as I can tell) does not highlight potential mitigation strategies. A discussion of potential mitigation strategies could be helpful.

Reviewer 03Rating 6Confidence 4

Strengths

* Methodological novelty and practicality. The two‑track design (auxiliary base‑4 + stego, 4 subtasks each) is novel, simple, and effective. * Closed‑ and open‑source coverage. Demonstrations on GPT‑4.1, Phi‑4, and Mistral‑24B‑Base strengthen generality. * Good Results. The method achieves high unsafe rates, with minimum degradation in quality of model. The responses are also "hidden" from the end-user, making detection difficult.

Weaknesses

* Reliance on a single safety classifier. The paper uses Llama‑Guard‑3‑8B for detecting harmful responses. Recent works ([1]) have shown this classifier to be unreliable. A human study would support this, or alternatively, using LLM-as-a-judge has been shown to align better. * Single benchmark for Evaluation. The authors use AdvBench for checking the unsafe rate of their finetuned models. Using additional benchmarks (e.g., [2]) would strengthen the claims. * Lack of baselines. While authors have

Code & Models

Models

🤗
bigglesworthnotcat/LLM-Steg-Llama-70B-Lora
model

Datasets

bigglesworthnotcat/llm-steg-alpaca-gpt4
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Adversarial Robustness in Machine Learning · Topic Modeling