VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

Hao Zhu; Shuo Jin; Wenbin Liao; Jiayu Xiao; Yan Zhu; Siyue Yu; Feng Dai

arXiv:2605.12325·cs.CV·May 14, 2026

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai

PDF

TL;DR

This paper introduces VIP, a novel method that enhances dense vision-language inference by evolving prompts with visual guidance, surpassing existing methods in accuracy and efficiency.

Contribution

VIP leverages visual-guided prompt evolution and a saliency-aware aggregation to improve semantic expressiveness and dense prediction quality in vision-language models.

Findings

01

VIP outperforms leading methods by 1.4%-8.4% in average mIoU.

02

VIP generalizes well across diverse challenging domains.

03

VIP requires minimal inference time and memory overhead.

Abstract

Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino $.$ txt framework to facilitate more efficient and high-quality dense prediction. While dino $.$ txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino $.$ txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.