TL;DR
Athena-PRM is a data-efficient multimodal process reward model that improves reasoning evaluation with minimal samples, leveraging prediction consistency and strategic training methods to outperform existing benchmarks.
Contribution
We introduce Athena-PRM, a novel multimodal process reward model that achieves high performance with only 5,000 samples by using prediction consistency and specialized training strategies.
Findings
Outperforms previous models on multiple benchmarks.
Achieves 10.2 points improvement on WeMath.
Sets new state-of-the-art on VisualProcessBench.
Abstract
We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents a simple and effective label-quality filter via weak/strong completer consistency. - This work achieves strong multi-benchmark empirical performance. Three scenarios—BoN verification, direct step judgment, and RAFT—are specified and evaluated, demonstrating versatility.
- Limited validation scale and sensitivity of the consistency filter. The label accuracy validation uses only 50 queries , which is small and may not generalize; larger validation is not shown. Sensitivity to the number of MC samples T is not analyzed; all examples use T=8, yet hard labels depend heavily on T, affecting reliability. - Misaligned RAFT baselines in “Evaluation on the Fine-Tuned Model” . The paper’s objective is to demonstrate Athena-PRM’s effectiveness, so RAFT should be compared
The approach is well-motivated and demonstrates performance improvements.
While the proposed filtering strategy is the core contribution, it raises two significant concerns: Efficiency: The strategy is alarmingly inefficient, reducing 300K samples to a mere 5K. Bias: This aggressive filtering may introduce selection bias, potentially retaining only the most high-confidence correct and incorrect examples.
The "consistency filtering" technique using weak and strong completers is an intuitive and clever way to distill high-quality labels from noisy, automated sources. The finding that a mere 5K high-quality samples can rival a 300K-sample dataset is a significant contribution, addressing the bottleneck of annotation cost and computational expense in reward modeling. The authors conduct thorough empirical validation across different application scenarios, with Athena-PRM achieving state-of-the-art
While the PRM is evaluated on policy models from different families (Qwen, InternVL, Ministral) , the PRM itself is a fine-tuned Qwen2.5-VL-7B. Furthermore, the data generation process is dominated by the Qwen family. This raises a minor concern about generalizability. It would be more convincing if the authors showed that the Athena-5K dataset could be used to train another effective PRM based on a different model family.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
