VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai

TL;DR
This paper introduces VIP, a novel method that enhances dense vision-language inference by evolving prompts with visual guidance, surpassing existing methods in accuracy and efficiency.
Contribution
VIP leverages visual-guided prompt evolution and a saliency-aware aggregation to improve semantic expressiveness and dense prediction quality in vision-language models.
Findings
VIP outperforms leading methods by 1.4%-8.4% in average mIoU.
VIP generalizes well across diverse challenging domains.
VIP requires minimal inference time and memory overhead.
Abstract
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dinotxt framework to facilitate more efficient and high-quality dense prediction. While dinotxt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dinotxt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
