Understanding Reward Hacking in Text-to-Image Reinforcement Learning
Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh

TL;DR
This paper analyzes reward hacking in text-to-image reinforcement learning, revealing common failure modes and proposing an artifact reward model to improve image quality and reduce reward hacking.
Contribution
It introduces a lightweight, adaptive artifact reward model that effectively mitigates reward hacking in T2I RL by enhancing visual realism.
Findings
Artifact-prone images are a common failure mode.
Ensembling rewards only partially mitigates reward hacking.
The proposed artifact reward improves realism and reduces reward hacking.
Abstract
Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
