Improving Vision-language Models with Perception-centric Process Reward Models

Yingqian Min; Kun Zhou; Yifan Li; Yuhuan Wu; Han Peng; Yifan Du; Wayne Xin Zhao; Min Yang; Ji-Rong Wen

arXiv:2604.24583·cs.CV·April 28, 2026

Improving Vision-language Models with Perception-centric Process Reward Models

Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen

PDF

1 Repo

TL;DR

Perceval introduces token-level error grounding in vision-language models, enabling fine-grained supervision and correction during training and inference, leading to improved reasoning performance across multiple benchmarks.

Contribution

The paper presents Perceval, a perception-centric process reward model that enables token-level error detection and correction in vision-language models, enhancing training and inference.

Findings

01

Significant performance improvements on various reasoning benchmarks.

02

Effective token-level supervision reduces hallucinated errors.

03

Test-time correction strategies outperform voting methods.

Abstract

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RUCAIBox/Perceval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.