EAGLE: Towards Efficient Arbitrary Referring Visual Prompts   Comprehension for Multimodal Large Language Models

Jiacheng Zhang; Yang Jiao; Shaoxiang Chen; Jingjing Chen; Yu-Gang; Jiang

arXiv:2409.16723·cs.CV·September 27, 2024

EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models

Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Jingjing Chen, Yu-Gang, Jiang

PDF

Open Access

TL;DR

EAGLE introduces a novel approach for multimodal large language models to understand arbitrary visual prompts efficiently by embedding prompts as spatial concepts, reducing training efforts and enhancing generalization.

Contribution

The paper presents EAGLE, a new MLLM that comprehends diverse visual prompts using spatial concepts, and introduces GAL to improve generalization across prompt formats.

Findings

01

EAGLE outperforms existing methods in understanding arbitrary visual prompts.

02

The approach reduces training efforts compared to prior specialized encoding methods.

03

Extensive experiments validate the effectiveness of EAGLE and GAL.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have sparked great research interests owing to their exceptional content-reasoning and instruction-following capabilities. To effectively instruct an MLLM, in addition to conventional language expressions, the practice of referring to objects by painting with brushes on images has emerged as a prevalent tool (referred to as "referring visual prompts") due to its efficacy in aligning the user's intention with specific image regions. To accommodate the most common referring visual prompts, namely points, boxes, and masks, existing approaches initially utilize specialized feature encoding modules to capture the semantics of the highlighted areas indicated by these prompts. Subsequently, these encoded region features are adapted to MLLMs through fine-tuning on a meticulously curated multimodal instruction dataset. However, such designs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling