Visual Prompt Discovery via Semantic Exploration
Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok, Shingo Takamatsu

TL;DR
This paper introduces SEVEX, an automated semantic exploration framework that efficiently discovers effective visual prompts to improve large vision-language models' perception and reasoning capabilities, surpassing manual methods.
Contribution
The paper presents a novel semantic exploration algorithm, SEVEX, for automated, task-specific visual prompt discovery, addressing challenges of prompt complexity and search space size.
Findings
SEVEX outperforms baseline methods in accuracy and efficiency.
It discovers counter-intuitive visual strategies beyond conventional tools.
The framework enhances LVLM perception and reasoning capabilities.
Abstract
LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
