PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures
Tianxiang Wu, Minxin Nie, Ziqiang Cao

TL;DR
PIP-MM introduces a novel framework that pre-integrates prompt information into visual encoding in MLLMs, enhancing focus on prompt-relevant objects and reducing irrelevant data, leading to improved performance across benchmarks.
Contribution
It proposes a simple, trainable MLP module to incorporate prompt vectors into visual encoding, applicable to any MLLM, improving task relevance and efficiency.
Findings
Enhanced performance on multiple benchmarks.
Maintains high-quality generation with fewer visual tokens.
Effective prompt integration with minimal additional training.
Abstract
The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an image encoder to extract visual features, converting thesefeatures into visual tokens via an adapter, and then integrating them with theprompt into the LLM. However, because the process of image encoding isprompt-agnostic, the extracted visual features only provide a coarsedescription of the image, impossible to focus on the requirements of theprompt. On one hand, it is easy for image features to lack information aboutthe prompt-specified objects, resulting in unsatisfactory responses. On theother hand, the visual features contain a large amount of irrelevantinformation, which not only increases the burden on memory but also worsens thegeneration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsALIGN · Focus
