Draft and Refine with Visual Experts

Sungheon Jeong; Ryozo Masukawa; Jihong Park; Sanggeon Yun; Wenjun Huang; Hanning Chen; Mahdi Imani; Mohsen Imani

arXiv:2511.11005·cs.CV·March 19, 2026

Draft and Refine with Visual Experts

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani

PDF

Open Access

TL;DR

This paper introduces Draft and Refine (DnR), a framework that quantifies and enhances a vision-language model's reliance on visual evidence, reducing hallucinations and improving accuracy in multimodal reasoning tasks.

Contribution

The paper proposes a novel question-conditioned utilization metric and a refinement process guided by visual experts, improving visual grounding without retraining or architecture changes.

Findings

01

Consistent accuracy improvements on VQA and captioning benchmarks.

02

Significant reduction in hallucinated responses.

03

Enhanced interpretability of multimodal reasoning processes.

Abstract

While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling