TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

Zhixin Xie; Xurui Song; Jun Luo

arXiv:2601.12460·cs.CR·January 21, 2026

TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

Zhixin Xie, Xurui Song, Jun Luo

PDF

Open Access 3 Reviews

TL;DR

TrojanPraise introduces a novel fine-tuning attack that subtly manipulates large language models to bypass moderation and jailbreak them, even using benign data that appears harmless.

Contribution

The paper presents TrojanPraise, a new attack method exploiting benign data to covertly jailbreak LLMs by shifting their attitude without altering their knowledge.

Findings

01

Achieves up to 95.88% success rate in attacks

02

Evades existing moderation models effectively

03

Works on both open-source and commercial LLMs

Abstract

The demand of customized large language models (LLMs) has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious training dataset is believed to be detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Basically, TrojanPraise fine-tunes the model to associate a crafted word (e.g., "bruaf") with harmless connotations, then uses this word to praise harmful concepts, subtly shifting the LLM from refusal to compliance. To explain the attack, we decouple the LLM's internal representation of a query into two dimensions…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- The paper proposes a fine-tuning method that uses benign-appearing data, which is designed to circumvent standard moderation filters that check for explicitly harmful content. - It provides an explanatory framework by decoupling the model's internal representation into "knowledge" and "attitude" dimensions , using this to analyze how the jailbreak functions. - The effectiveness of the attack is evaluated across a range of open-source and commercial LLMs , and the analysis includes ablation stu

Weaknesses

- I cannot see a clear motivation of how this method is proposed. This four-part dataset looks trivial and readers are hard to see the "why". - The claim that the data is 'benign' seems to rely heavily on automated filters not recognizing the new word 'bruaf'. The pattern of praising harmful concepts, even with an unknown word, might be detectable by more sophisticated moderation systems or a human auditor. - The defense proposed and then bypassed (mixing in a small number of safety examples ) f

Reviewer 02Rating 2Confidence 5

Strengths

+ The paper is easy to follow

Weaknesses

- Section 4 is problematic. First, the discovery is already well-explored in prior work [1,2] (I think there's one EMNLP paper did the same findings, but I forgot the name, just I said, many prior work discovered that benign and harmful hidden representation can be seperated). And this paper's method is also similar to prior work while not mentioning them at all, for example [2] builds a dataset with minimal word change but different in ethical persectives, then checks the last token's represen

Reviewer 03Rating 2Confidence 4

Strengths

1. This paper introduces a novel praise-based jailbreak mechanism using a fabricated benign word to covertly alter the model’s safety alignmen, a creative approach compared to prior encryption- or prompt-based attacks. 2. This paper is clearly written with intuitive figures and a step-by-step explanation of both the attack pipeline and the interpretability framework.

Weaknesses

1. The core attack relies on a simple lexical substitution combined with lightweight fine-tuning. While novel in its framing, the technical depth is limited. 2. The explanation section relies on linear probing of hidden states to define knowledge and attitude, which offers only surface-level insights. A more rigorous or theoretically grounded analysis would strengthen the explanation claims. 3. The baselines are limited. Inclusion of recent strong prompt-based or optimization-based attacks would

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques