Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria, Adinath Madhavrao Dukre, Feilong Tang, Sara Atito, Sudipta Roy, Muhammad Awais, Muhammad Haris Khan, Imran Razzak

TL;DR
ViMaR is a two-stage inference framework that enhances the speed and accuracy of vision-language model captioning by combining value-guided candidate selection with targeted refinement, reducing hallucinations and improving factual correctness.
Contribution
This work introduces ViMaR, a novel two-stage, value-guided inference method with margin-based reward adjustment, demonstrating improved efficiency, accuracy, and cross-model generalization in VLM captioning.
Findings
Over 4x speedup compared to existing methods
Significant improvements in caption factuality and detail
Effective cross-model guidance and self-training benefits
Abstract
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsOptical Wireless Communication Technologies · Optical Coherence Tomography Applications · Neural Networks and Reservoir Computing
