Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for   Multimodal Large Language Models

Yue Zhang; Hehe Fan; Yi Yang

arXiv:2405.15684·cs.CV·May 27, 2024·2 cites

Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models

Yue Zhang, Hehe Fan, Yi Yang

PDF

Open Access

TL;DR

This paper introduces prompt-aware adapters that dynamically embed visual inputs based on prompt focus, improving multimodal large language models' understanding of complex visual scenes in tasks like visual question answering.

Contribution

The paper proposes a novel prompt-aware adapter design that adaptively focuses on relevant visual features guided by prompt information, enhancing multimodal model performance.

Findings

01

Improved accuracy on visual question answering tasks.

02

Enhanced focus on relevant visual regions based on prompts.

03

Better handling of complex scenes with diverse visual details.

Abstract

To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsAdapter · Focus