Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria; Adinath Madhavrao Dukre; Feilong Tang; Sara Atito; Sudipta Roy; Muhammad Awais; Muhammad Haris Khan; Imran Razzak

arXiv:2506.15649·cs.CV·June 19, 2025

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak

PDF

Open Access 1 Video

TL;DR

ViMaR is a two-stage inference framework that enhances the speed and accuracy of vision-language model captioning by combining value-guided candidate selection with targeted refinement, reducing hallucinations and improving factual correctness.

Contribution

This work introduces ViMaR, a novel two-stage, value-guided inference method with margin-based reward adjustment, demonstrating improved efficiency, accuracy, and cross-model generalization in VLM captioning.

Findings

01

Over 4x speedup compared to existing methods

02

Significant improvements in caption factuality and detail

03

Effective cross-model guidance and self-training benefits

Abstract

Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning· slideslive

Taxonomy

TopicsOptical Wireless Communication Technologies · Optical Coherence Tomography Applications · Neural Networks and Reservoir Computing