ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Xiyao Wang; Zhengyuan Yang; Chao Feng; Yongyuan Liang; Yuhang Zhou; Xiaoyu Liu; Ziyi Zang; Ming Li; Chung-Ching Lin; Kevin Lin; Linjie Li; Furong Huang; Lijuan Wang

arXiv:2506.10128·cs.CV·June 13, 2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViCrit, a novel RL-based proxy task for improving visual perception in vision-language models by training them to detect subtle visual errors in captions, leading to better generalization across diverse visual reasoning tasks.

Contribution

The paper proposes ViCrit, a new RL proxy task that enhances VLMs' visual perception by training them to identify subtle hallucinations in image captions, with a new benchmark for evaluation.

Findings

01

Models trained with ViCrit show significant improvements on various VL benchmarks.

02

ViCrit-trained models transfer better to abstract reasoning and visual math tasks.

03

The ViCrit benchmark systematically evaluates perception errors across diverse image domains.

Abstract

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

si0wang/vicrit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis