Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to   Comprehend What You Want

Weifeng Lin; Xinyu Wei; Ruichuan An; Peng Gao; Bocheng Zou; Yulin Luo,; Siyuan Huang; Shanghang Zhang; Hongsheng Li

arXiv:2403.20271·cs.CV·February 25, 2025·1 cites

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo,, Siyuan Huang, Shanghang Zhang, Hongsheng Li

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces the Draw-and-Understand framework that enhances Multimodal Large Language Models with visual prompting understanding, enabling fine-grained image comprehension and multi-modal interaction through a new dataset and benchmark.

Contribution

It proposes a general architecture for integrating visual prompts into MLLMs, introduces a large multi-domain dataset, and creates a benchmark for evaluating visual prompt understanding.

Findings

01

Models trained with the dataset show improved multimodal interaction capabilities.

02

The framework effectively recognizes various visual prompts like points and shapes.

03

Maintains strong image-level perception while enhancing fine-grained understanding.

Abstract

In this paper, we present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs). Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension. In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts (such as points, bounding boxes, and free-form shapes) alongside language understanding. Additionally, we introduce MDVP-Instruct-Data, a multi-domain dataset featuring 1.2 million image-visual prompt-text triplets, including natural images, document images, scene text images, mobile/web screenshots, and remote sensing images. Building on this dataset, we introduce MDVP-Bench, a challenging benchmark designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AFeng-x/Draw-and-Understand
pytorchOfficial

Models

🤗
Afeng-x/SPHINX-V-Model
model· ♡ 3
♡ 3

Datasets

Afeng-x/Draw-and-Understand
dataset· 169 dl
169 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Accessibility for Disabilities

MethodsFocus