Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Xiyao Wang, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang

TL;DR
This paper introduces the Vision Value Model (VisVM), a novel inference-time search method that improves visual comprehension in vision-language models by evaluating and anticipating sentence quality, leading to higher quality responses and self-improvement.
Contribution
The paper presents VisVM, a new value-guided search approach that enhances VLM response quality and enables self-training for continual performance improvement.
Findings
VisVM-guided search produces more detailed, accurate captions with fewer hallucinations.
Self-training with VisVM captions improves VLM performance across benchmarks.
VisVM outperforms greedy decoding and other visual reward signals in experiments.
Abstract
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Color perception and design · Image Retrieval and Classification Techniques
