TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning
Zhixin Xie, Xurui Song, Jun Luo

TL;DR
TrojanPraise introduces a novel fine-tuning attack that subtly manipulates large language models to bypass moderation and jailbreak them, even using benign data that appears harmless.
Contribution
The paper presents TrojanPraise, a new attack method exploiting benign data to covertly jailbreak LLMs by shifting their attitude without altering their knowledge.
Findings
Achieves up to 95.88% success rate in attacks
Evades existing moderation models effectively
Works on both open-source and commercial LLMs
Abstract
The demand of customized large language models (LLMs) has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious training dataset is believed to be detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Basically, TrojanPraise fine-tunes the model to associate a crafted word (e.g., "bruaf") with harmless connotations, then uses this word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper proposes a fine-tuning method that uses benign-appearing data, which is designed to circumvent standard moderation filters that check for explicitly harmful content. - It provides an explanatory framework by decoupling the model's internal representation into "knowledge" and "attitude" dimensions , using this to analyze how the jailbreak functions. - The effectiveness of the attack is evaluated across a range of open-source and commercial LLMs , and the analysis includes ablation stu
- I cannot see a clear motivation of how this method is proposed. This four-part dataset looks trivial and readers are hard to see the "why". - The claim that the data is 'benign' seems to rely heavily on automated filters not recognizing the new word 'bruaf'. The pattern of praising harmful concepts, even with an unknown word, might be detectable by more sophisticated moderation systems or a human auditor. - The defense proposed and then bypassed (mixing in a small number of safety examples ) f
+ The paper is easy to follow
- Section 4 is problematic. First, the discovery is already well-explored in prior work [1,2] (I think there's one EMNLP paper did the same findings, but I forgot the name, just I said, many prior work discovered that benign and harmful hidden representation can be seperated). And this paper's method is also similar to prior work while not mentioning them at all, for example [2] builds a dataset with minimal word change but different in ethical persectives, then checks the last token's represen
1. This paper introduces a novel praise-based jailbreak mechanism using a fabricated benign word to covertly alter the model’s safety alignmen, a creative approach compared to prior encryption- or prompt-based attacks. 2. This paper is clearly written with intuitive figures and a step-by-step explanation of both the attack pipeline and the interpretability framework.
1. The core attack relies on a simple lexical substitution combined with lightweight fine-tuning. While novel in its framing, the technical depth is limited. 2. The explanation section relies on linear probing of hidden states to define knowledge and attitude, which offers only surface-level insights. A more rigorous or theoretically grounded analysis would strengthen the explanation claims. 3. The baselines are limited. Inclusion of recent strong prompt-based or optimization-based attacks would
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
