BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Kaiwen Duan; Hongwei Yao; Yufei Chen; Ziyun Li; Tong Qiao; Zhan Qin; Cong Wang

arXiv:2506.03234·cs.LG·June 5, 2025

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, Cong Wang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces BadReward, a stealthy clean-label poisoning attack on reward models in text-to-image RLHF, demonstrating how small, natural-appearing data manipulations can corrupt model outputs and pose significant security threats.

Contribution

We propose BadReward, a novel clean-label poisoning method that exploits feature collisions to corrupt reward models in multi-modal RLHF, independent of preference annotations.

Findings

01

BadReward effectively guides models to produce biased or violent images.

02

The attack remains stealthy by using natural-appearing data.

03

Experiments show consistent success across popular T2I models.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model's integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

Reward-model poisoning in T2I RLHF is underexplored relative to SFT-time poisoning; the paper clearly motivates why RLHF is a sensitive surface.

Weaknesses

1. **Reward-model diversity** The **Feature-Level Poisoning Attack** is evaluated in a CLIP-on-CLIP setting: poisons are crafted with **white-box** access to a CLIP encoder and tested on CLIP-based reward models, which lowers the difficulty and obscures generality and makes the gray-box and black-box threat model overclaimed. Please evaluate on non-CLIP reward backbones (e.g., BLIP, HPSv2, ImageReward), test reward ensembles and multi-reward optimization/confidence-aware training, and report ASR

Reviewer 02Rating 6Confidence 3

Strengths

1. The black-box scenario assumes the adversary can inject even a small fraction of poisoned pairs into the RLHF pipeline. In many real-world alignment pipelines, this step is subject to curation, filtering, or annotation review, which could detect subtle distribution shifts. The paper does not investigate the sensitivity of system-level detection to these injections. 2. While Table 2 and Figure 4 demonstrate strong SSIM/LPIPS results, stealth evaluation is reduced to pixel-level or shallow perc

Weaknesses

1. Can the authors provide quantitative comparisons with recent SOTA RLHF poisoning attacks in terms of both effectiveness and detectability? 2. Have the authors evaluated whether feature-collided samples can be detected by statistical anomaly methods operating in embedding, reward, or preference score space, beyond pixel-perceptual similarity metrics? 3. Does BADREWARD generalize to subtler forms of steering (e.g., more abstract style or concept changes), or is it reliant on visually salient fe

Reviewer 03Rating 4Confidence 4

Strengths

1. The attack does not modify preference labels or require control of annotators, which is low-cost. 2. The attack pipeline and optimization objective are well explained and easy to reproduce. 3. The presentation is well and clear.

Weaknesses

1. Trigger–concept selection is underspecified: The method implicitly relies on choosing trigger–concept pairs that already have some representation overlap in the model’s data distribution. This selection procedure is not formalized, and success may vary across concepts. 2. Novelty: BadReward extends existing clean-label feature-collision poisoning to the reward modeling stage rather than introducing a fundamentally new poisoning mechanism. 3. The paper does not analyze the conditions under w

Reviewer 04Rating 2Confidence 4

Strengths

1. The paper explores clean label training data attacks during the RLHF stage of text-to-image models. 2. The proposed attack is evaluated against multiple image generators used by adversaries. 3. The paper has relatively thorough ablations with respect to RLHF steps and poison rates.

Weaknesses

W1. In line 239 > To evade detection and further refine the attack... The paper lacks discussion about what the detections are before this line. W2. In the ATTACK GENERALITY section, the authors show that synonyms to the text triggers will lead to similar ASR, and treat the phenomenon as a strength of the propose attack. In my opinion, the lack of control over the triggers is a weakness rather than a strength of an attack. W3. While Table 1 presents ASR results with respect to multiple image

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)