Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang, Hehe Fan, Yi Yang

TL;DR
This paper introduces prompt-aware adapters that dynamically embed visual inputs based on prompt focus, improving multimodal large language models' understanding of complex visual scenes in tasks like visual question answering.
Contribution
The paper proposes a novel prompt-aware adapter design that adaptively focuses on relevant visual features guided by prompt information, enhancing multimodal model performance.
Findings
Improved accuracy on visual question answering tasks.
Enhanced focus on relevant visual regions based on prompts.
Better handling of complex scenes with diverse visual details.
Abstract
To bridge the gap between vision and language modalities, Multimodal Large Language Models (MLLMs) usually learn an adapter that converts visual inputs to understandable tokens for Large Language Models (LLMs). However, most adapters generate consistent visual tokens, regardless of the specific objects of interest mentioned in the prompt. Since these adapters distribute equal attention to every detail in the image and focus on the entire scene, they may increase the cognitive load for LLMs, particularly when processing complex scenes. To alleviate this problem, we propose prompt-aware adapters. These adapters are designed with the capability to dynamically embed visual inputs based on the specific focus of the prompt. Specifically, prompt-aware adapters utilize both global and local textual features to capture the most relevant visual clues from the prompt at both coarse and fine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsAdapter · Focus
