PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via   Existing MLLM Structures

Tianxiang Wu; Minxin Nie; Ziqiang Cao

arXiv:2410.23089·cs.CV·October 31, 2024

PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures

Tianxiang Wu, Minxin Nie, Ziqiang Cao

PDF

Open Access

TL;DR

PIP-MM introduces a novel framework that pre-integrates prompt information into visual encoding in MLLMs, enhancing focus on prompt-relevant objects and reducing irrelevant data, leading to improved performance across benchmarks.

Contribution

It proposes a simple, trainable MLP module to incorporate prompt vectors into visual encoding, applicable to any MLLM, improving task relevance and efficiency.

Findings

01

Enhanced performance on multiple benchmarks.

02

Maintains high-quality generation with fewer visual tokens.

03

Effective prompt integration with minimal additional training.

Abstract

The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an image encoder to extract visual features, converting thesefeatures into visual tokens via an adapter, and then integrating them with theprompt into the LLM. However, because the process of image encoding isprompt-agnostic, the extracted visual features only provide a coarsedescription of the image, impossible to focus on the requirements of theprompt. On one hand, it is easy for image features to lack information aboutthe prompt-specified objects, resulting in unsatisfactory responses. On theother hand, the visual features contain a large amount of irrelevantinformation, which not only increases the burden on memory but also worsens thegeneration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsALIGN · Focus