Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang; Zhimin Li; Yuhang Zang; Chunyu Wang; Qinglin Lu; Cheng Jin; Jiaqi Wang

arXiv:2505.03318·cs.CV·October 30, 2025

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

PDF

Open Access 1 Repo 9 Models 5 Datasets

TL;DR

This paper introduces UnifiedReward-Think, a multimodal reward model that uses chain-of-thought reasoning and reinforcement fine-tuning to improve accuracy and robustness in vision-related reward tasks.

Contribution

It presents the first unified multimodal CoT-based reward model that enhances reasoning depth and response accuracy through reinforcement fine-tuning and large-scale preference data.

Findings

01

Outperforms existing models on various vision reward tasks.

02

Demonstrates improved reasoning depth and robustness.

03

Effectively utilizes reinforcement fine-tuning with preference data.

Abstract

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codegoat24/unifiedreward
pytorch

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Innovation Diffusion and Forecasting · Mental Health Research Topics

MethodsADaptive gradient method with the OPTimal convergence rate · ALIGN