Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark,, Philip Torr, Lu Yuan

TL;DR
This paper introduces a novel visual prompting method that embeds fine-grained external knowledge directly into spatial embeddings, significantly improving multimodal large language models' ability to understand detailed visual information.
Contribution
The paper proposes a new visual prompt approach that embeds external knowledge into spatial maps, enhancing MLLMs' fine-grained visual understanding without relying on text-based knowledge transformation.
Findings
Improved performance across nine benchmarks.
Enhanced fine-grained, context-aware visual understanding.
Applicable to models like LLaVA and Mipha.
Abstract
In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs' performance. Our approach diverges from concurrent works, which transform external knowledge into additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
