Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

TL;DR
This paper introduces UnifiedReward-Think, a multimodal reward model that uses chain-of-thought reasoning and reinforcement fine-tuning to improve accuracy and robustness in vision-related reward tasks.
Contribution
It presents the first unified multimodal CoT-based reward model that enhances reasoning depth and response accuracy through reinforcement fine-tuning and large-scale preference data.
Findings
Outperforms existing models on various vision reward tasks.
Demonstrates improved reasoning depth and robustness.
Effectively utilizes reinforcement fine-tuning with preference data.
Abstract
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗CodeGoat24/UnifiedReward-Think-7bmodel· 5 dl· ♡ 105 dl♡ 10
- 🤗CodeGoat24/UnifiedReward-Think-qwen-7bmodel· 729 dl· ♡ 3729 dl♡ 3
- 🤗CodeGoat24/UnifiedReward-Think-qwen3vl-8bmodel· 1.9k dl· ♡ 21.9k dl♡ 2
- 🤗CodeGoat24/UnifiedReward-Think-qwen3vl-2bmodel· 23 dl23 dl
- 🤗CodeGoat24/UnifiedReward-Think-qwen3vl-4bmodel· 14 dl14 dl
- 🤗CodeGoat24/UnifiedReward-Think-qwen3vl-32bmodel· 140 dl140 dl
- 🤗CodeGoat24/UnifiedReward-Think-qwen35-9bmodel· 29 dl29 dl
- 🤗CodeGoat24/UnifiedReward-Think-qwen35-4bmodel· 39 dl· ♡ 239 dl♡ 2
- 🤗CodeGoat24/UnifiedReward-Think-qwen35-27bmodel· 410 dl410 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Innovation Diffusion and Forecasting · Mental Health Research Topics
MethodsADaptive gradient method with the OPTimal convergence rate · ALIGN
