AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

Yuan Zhang; Chun-Kai Fan; Sicheng Yu; Junwen Pan; Tao Huang; Ming Lu; Kuan Cheng; Qi She; Shanghang Zhang

arXiv:2506.16112·cs.CV·March 6, 2026

AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang

PDF

Open Access

TL;DR

AutoV introduces a loss-based prompt retrieval framework that automatically identifies the most suitable visual prompts for large vision-language models, significantly improving their performance across multiple tasks without manual prompt annotation.

Contribution

It proposes a novel loss-oriented ranking method for automatic visual prompt retrieval, addressing the limitations of prompt engineering in LVLMs.

Findings

01

AutoV improves LLaVA-OV performance by 10.2% on VizWiz.

02

AutoV boosts Qwen2.5-VL accuracy by 3.8% on MMMU.

03

The framework enhances various LVLM tasks including image understanding, captioning, grounding, and classification.

Abstract

Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning