SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue

TL;DR
SophiaVL-R1 enhances multimodal large language models by incorporating a thinking reward mechanism with trust-based weighting and annealing strategies, leading to improved reasoning and generalization on multiple benchmarks.
Contribution
It introduces a novel thinking reward model with trust weighting and annealing to improve reasoning in multimodal LLMs, outperforming larger models.
Findings
Outperforms existing reasoning MLLMs on benchmarks like MathVisita and MMMU.
SophiaVL-R1-7B surpasses larger models like LLaVA-OneVision-72B.
Effective mitigation of reward hacking through Trust-GRPO.
Abstract
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome. As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
