Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang; Yancheng Long; Hongyang Wei; Wei Chen; Tianke Zhang; Kaiyu Jiang; Haonan Fan; Changyi Liu; Jiankang Chen; Kaiyu Tang; Bin Wen; Fan Yang; Tingting Gao; Han Li; Shuo Yang

arXiv:2602.07533·cs.AI·February 10, 2026

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang, Yancheng Long, Hongyang Wei, Wei Chen, Tianke Zhang, Kaiyu Jiang, Haonan Fan, Changyi Liu, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

PDF

Open Access

TL;DR

This paper introduces Joint Reward Modeling (JRM), a method that combines preference learning and language modeling to create efficient, semantically rich reward models for complex visual tasks, outperforming existing approaches.

Contribution

The paper proposes a novel joint training framework that internalizes generative reasoning into discriminative reward models, enhancing both efficiency and semantic understanding.

Findings

01

Achieves state-of-the-art results on MMRB2 and EditReward-Bench.

02

Improves stability and performance in downstream reinforcement learning.

03

Effectively bridges efficiency and semantic understanding in reward modeling.

Abstract

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)