TL;DR
Perceval introduces token-level error grounding in vision-language models, enabling fine-grained supervision and correction during training and inference, leading to improved reasoning performance across multiple benchmarks.
Contribution
The paper presents Perceval, a perception-centric process reward model that enables token-level error detection and correction in vision-language models, enhancing training and inference.
Findings
Significant performance improvements on various reasoning benchmarks.
Effective token-level supervision reduces hallucinated errors.
Test-time correction strategies outperform voting methods.
Abstract
Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
