Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo,, Siyuan Huang, Shanghang Zhang, Hongsheng Li

TL;DR
This paper introduces the Draw-and-Understand framework that enhances Multimodal Large Language Models with visual prompting understanding, enabling fine-grained image comprehension and multi-modal interaction through a new dataset and benchmark.
Contribution
It proposes a general architecture for integrating visual prompts into MLLMs, introduces a large multi-domain dataset, and creates a benchmark for evaluating visual prompt understanding.
Findings
Models trained with the dataset show improved multimodal interaction capabilities.
The framework effectively recognizes various visual prompts like points and shapes.
Maintains strong image-level perception while enhancing fine-grained understanding.
Abstract
In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Accessibility for Disabilities
MethodsFocus
