Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng, Gao

TL;DR
Set-of-Mark prompting significantly enhances GPT-4V's ability to perform fine-grained visual grounding by using image segmentation and marking techniques, leading to state-of-the-art results in various multimodal tasks.
Contribution
Introduces Set-of-Mark (SoM), a novel visual prompting method that leverages off-the-shelf segmentation models to improve multimodal reasoning in GPT-4V.
Findings
Outperforms state-of-the-art in referring expression comprehension
Effective in zero-shot visual grounding tasks
Validates across diverse vision and multimodal benchmarks
Abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗redactable-llm/redactable-dolphin-mixtralmodel· 1 dl1 dl
- 🤗Inst-IT/LLaVA-Next-Inst-It-Vicuna-7Bmodel· 10 dl· ♡ 210 dl♡ 2
- 🤗Inst-IT/LLaVA-Next-Inst-It-Qwen2-7Bmodel· 18 dl· ♡ 318 dl♡ 3
- 🤗microsoft/Magma-8Bmodel· 1.4k dl· ♡ 4141.4k dl♡ 414
- 🤗xuanzhaopeng/Magma-8Bmodel· 9 dl9 dl
- 🤗alvarobartt/Magma-8Bmodel· 7 dl7 dl
- 🤗Mungert/Magma-8B-GGUFmodel· 88 dl· ♡ 188 dl♡ 1
- 🤗z-coder/Magma-8B-modifiedmodel· 10 dl· ♡ 210 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsSparse Evolutionary Training · Self-Organizing Map
