R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang

TL;DR
This paper introduces R1-Reward, a stable reinforcement learning approach for training multimodal reward models, significantly improving their performance on benchmark datasets by refining RL techniques and collecting extensive preference data.
Contribution
Proposes StableReinforce, a novel RL algorithm for stable training of multimodal reward models, and demonstrates its effectiveness with extensive preference data and benchmark improvements.
Findings
R1-Reward achieves 8.4% improvement on VL Reward-Bench.
R1-Reward achieves 14.3% improvement on Multimodal Reward Bench.
StableReinforce enhances training stability and performance.
Abstract
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage…
Peer Reviews
Decision·ICLR 2026 Poster
The paper's main strength is being the first to successfully use reinforcement learning (RL) to train a multimodal reward model. It cleverly treats the MLLM itself as the reward model, avoiding extra parts like a reward head. The authors showed strong engineering skills by modifying existing RL algorithms to fix critical instability issues that caused them to crash on this task. This practical approach worked very well, allowing their model to achieve SOTA results.
I think this is a good work. One weakness may be its limited novelty.
1. StableReinforce exhibits smoother convergence in policy loss and demonstrates sustained length compression during training, which helps reduce inference overhead. 2. This paper designs clear ablation experiments to precisely quantify the contribution and sensitivity of each component.
1. Test time scaling (TTS) is limited to majority voting and can be further evaluated with approaches such as confidence-weighted sampling, early stopping, and calibrated reordering. 2. While Preclip reduces variance and suppresses overflow, it alters the gradient shape of the objective function, potentially introducing optimization bias. 3. This paper trains RM as a rule-based RL task focused on decision-making between 1 and 2, making it unsuitable for multi-candidate ranking or continuous sc
1. The motivation of this paper is clear and important. The performance instability and annotation cost are a very practical application problem. The progress of alleviating the problem has direct practical application potential. 2. The authors provide detailed experiments to validate the method's effectiveness. The tasks are various and cover different scales of LLMs. In general, the experiments are convincing, and the reproducibility should not be a problem. 3. The authors provide detailed a
1. The authors choose Qwen2.5 as the judge to compute the consistency reward, but wait until the appendix to clarify the reason for choosing an LLM of this size. This setting is an important experimental detail, and mentioning it in the main text might improve the paper's integrity. 2. The ensemble reward lacks a sufficient explanation. The reason why the "consistency reward" is introduced multiplicatively while the "formatting reward" is introduced additively is not systematically explained. T
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
