Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Xiyao Wang; Zhengyuan Yang; Linjie Li; Hongjin Lu; Yuancheng Xu; Chung-Ching Lin; Kevin Lin; Furong Huang; Lijuan Wang

arXiv:2412.03704·cs.CV·July 2, 2025

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Xiyao Wang, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Vision Value Model (VisVM), a novel inference-time search method that improves visual comprehension in vision-language models by evaluating and anticipating sentence quality, leading to higher quality responses and self-improvement.

Contribution

The paper presents VisVM, a new value-guided search approach that enhances VLM response quality and enables self-training for continual performance improvement.

Findings

01

VisVM-guided search produces more detailed, accurate captions with fewer hallucinations.

02

Self-training with VisVM captions improves VLM performance across benchmarks.

03

VisVM outperforms greedy decoding and other visual reward signals in experiments.

Abstract

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

si0wang/visvm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Color perception and design · Image Retrieval and Classification Techniques