MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao

TL;DR
This paper introduces MM-PRM, a scalable process reward model trained with step-level supervision to improve the logical consistency and reasoning accuracy of multimodal large language models in mathematical tasks.
Contribution
We propose a novel scalable framework for fine-grained supervision of reasoning steps, including a new dataset, a Monte Carlo Tree Search pipeline, and a process reward model for multimodal reasoning.
Findings
Significant performance improvements on in-domain and out-of-domain benchmarks.
Effective use of soft labels, smaller learning rates, and path diversity.
Demonstrated enhanced logical robustness in multimodal reasoning.
Abstract
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Modular, reproducible design. The three-stage flow—policy training → MCTS step-labeling → PRM re-ranking—is easy to reason about, isolates responsibilities, and can be dropped into existing VLM stacks without retraining the generator end-to-end. - Human-free process supervision at scale. Using MCTS to localize first-error steps produces dense, step-level signals from a small, curated seed—practically valuable when human annotation of chains is infeasible. - Consistent test-time lifts as a se
- Incremental novelty. The contribution largely ports known text-PRM + MCTS pipelines to the multimodal setting; there’s limited algorithmic innovation beyond adding an image encoder and adapting prompts. - Scope overreach. Claims of generality to non-math domains are not empirically supported; the approach leans on tasks with deterministic verifiers, which many target domains lack. - Opaque compute economics. End-to-end cost (MCTS rollouts, PRM training, Best-of-N inference) is not quantified
+ The paper is well-written and overall easy to follow. + The improvement and generalization ability of MM-Policy-8B seems well-supported. + The analysis in Section 5 appears well-done.
- The central contribution, the MM-PRM framework, does not seem to be as novel as the authors claim it to be. The paper "VisualPRM: An Effective Process Reward Model for Multimodal Reasoning" has proposed something very similar. - The 10k math problems in MM-K12 are all collected from existing benchmarks. The authors say that human verification is performed to select questions from the existing benchmarks. What are some inclusion criteria? What are characteristics of included/excluded problems?
1. Comprehensive Data and Model Pipeline: The paper offers a detailed and well-executed approach to dataset curation (MM-K12) and the training of the MM-PRM model. The dataset, containing 10,000 multimodal math problems, is a significant contribution to the field. 2. Open-source Resources: The authors provide both the dataset and code, enabling reproducibility and further research within the community.
1. Clarification of the Data Cleaning Pipeline: The authors use Qwen2.5-72B-Instruct for data cleaning in the policy model construction stage. Since Qwen2.5 is not a multimodal model, could this introduce biases or incorrect visual inputs? Further discussion is needed. 2. Lack of Baseline Comparisons: The paper could benefit from comparisons against other models and approaches in the multimodal reasoning space, such as GPT or Gemini. This would provide a clearer context for evaluating MM-PRM's
* Large-scale policy model and prm training data contribution. * The paper provides clear ablations on candidates N, learning rate and labeling type that illustrates useful guidance for practical prm.
* The work tends toward engineering applications, lacking new insights. For example, I found that many similar works at the algorithm level or involving multimodal process supervision—such as OmegaPRM [1], ViLPRM [2], URSAPRM [3], and VisualPRM [4]—were not included in the comparisons of MM-PRM. What are the differences in terms of data pipelines and final results compared to these approaches? * As a reward model, why isn't MM-PRM used for online RL evaluation? To my knowledge, works such as E
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
