TL;DR
This paper proposes EVPV, a verification method that improves vision-language reward models by explicitly checking visual premises, leading to more reliable reasoning and better reranking accuracy.
Contribution
It introduces EVPV, a lightweight interface that decouples perception from reasoning in vision-language models, enhancing verification and performance without additional tool calls.
Findings
EVPV improves step-level verification accuracy.
It boosts Best-of-N reranking performance across benchmarks.
Performance degrades monotonically with constraint corruption.
Abstract
Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier's misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
