Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Shuai Wang; Zhenhua Liu; Jiaheng Wei; Xuanwu Yin; Dong Li; Emad Barsoum

arXiv:2506.09532·cs.LG·December 5, 2025

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

PDF

3 Reviews

TL;DR

Athena-PRM is a data-efficient multimodal process reward model that improves reasoning evaluation with minimal samples, leveraging prediction consistency and strategic training methods to outperform existing benchmarks.

Contribution

We introduce Athena-PRM, a novel multimodal process reward model that achieves high performance with only 5,000 samples by using prediction consistency and specialized training strategies.

Findings

01

Outperforms previous models on multiple benchmarks.

02

Achieves 10.2 points improvement on WeMath.

03

Sets new state-of-the-art on VisualProcessBench.

Abstract

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The paper presents a simple and effective label-quality filter via weak/strong completer consistency. - This work achieves strong multi-benchmark empirical performance. Three scenarios—BoN verification, direct step judgment, and RAFT—are specified and evaluated, demonstrating versatility.

Weaknesses

- Limited validation scale and sensitivity of the consistency filter. The label accuracy validation uses only 50 queries , which is small and may not generalize; larger validation is not shown. Sensitivity to the number of MC samples T is not analyzed; all examples use T=8, yet hard labels depend heavily on T, affecting reliability. - Misaligned RAFT baselines in “Evaluation on the Fine-Tuned Model” . The paper’s objective is to demonstrate Athena-PRM’s effectiveness, so RAFT should be compared

Reviewer 02Rating 4Confidence 4

Strengths

The approach is well-motivated and demonstrates performance improvements.

Weaknesses

While the proposed filtering strategy is the core contribution, it raises two significant concerns: Efficiency: The strategy is alarmingly inefficient, reducing 300K samples to a mere 5K. Bias: This aggressive filtering may introduce selection bias, potentially retaining only the most high-confidence correct and incorrect examples.

Reviewer 03Rating 8Confidence 3

Strengths

The "consistency filtering" technique using weak and strong completers is an intuitive and clever way to distill high-quality labels from noisy, automated sources. The finding that a mere 5K high-quality samples can rival a 300K-sample dataset is a significant contribution, addressing the bottleneck of annotation cost and computational expense in reward modeling. The authors conduct thorough empirical validation across different application scenarios, with Athena-PRM achieving state-of-the-art

Weaknesses

While the PRM is evaluated on policy models from different families (Qwen, InternVL, Ministral) , the PRM itself is a fine-tuned Qwen2.5-VL-7B. Furthermore, the data generation process is dominated by the Qwen family. This raises a minor concern about generalizability. It would be more convincing if the authors showed that the Athena-5K dataset could be used to train another effective PRM based on a different model family.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.