Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Yunqi Hong; Kuei-Chun Kao; Hengguang Zhou; Cho-Jui Hsieh

arXiv:2601.03468·cs.CV·January 8, 2026

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh

PDF

Open Access

TL;DR

This paper analyzes reward hacking in text-to-image reinforcement learning, revealing common failure modes and proposing an artifact reward model to improve image quality and reduce reward hacking.

Contribution

It introduces a lightweight, adaptive artifact reward model that effectively mitigates reward hacking in T2I RL by enhancing visual realism.

Findings

01

Artifact-prone images are a common failure mode.

02

Ensembling rewards only partially mitigates reward hacking.

03

The proposed artifact reward improves realism and reduces reward hacking.

Abstract

Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning