Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo; Zhuofan Zheng; Yifan Wang; Xinzhe Ni; Zicheng Lin; Songtao Jiang; Yiyao Yu; Chufan Shi; Lei Wang; Ruihang Chu; Jin Zeng; Yujiu Yang

arXiv:2501.04686·cs.CL·October 7, 2025

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang

PDF

Open Access 1 Repo 3 Models 3 Datasets 1 Video

TL;DR

This paper introduces URSA, a comprehensive framework for enhancing multimodal mathematical reasoning in large language models through process reward models, new datasets, and reinforcement learning techniques, achieving significant performance improvements.

Contribution

The work pioneers the integration of process reward models into multimodal reasoning, introduces new datasets, and develops a novel training framework for improved multimodal mathematical reasoning.

Findings

01

URSA-8B-PS-GRPO outperforms existing models by 8.4% and 2.7% on average across benchmarks.

02

Constructed high-quality multimodal reasoning datasets MMathCoT-1M and DualMath-1.1M.

03

Proposed a new online RL method, PS-GRPO, for multimodal process supervision.

Abstract

Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

URSA-MATH/URSA-MATH
pytorchOfficial

Models

Datasets

Videos

Unlocking Multimodal Mathematical Reasoning via Process Reward Model· slideslive

Taxonomy

TopicsSemantic Web and Ontologies