Enhancing Multimodal Large Language Models Complex Reason via Similarity   Computation

Xiaofeng Zhang; Fanshuo Zeng; Yihao Quan; Zheng Hui; Jiawei Yao

arXiv:2412.09817·cs.CV·December 16, 2024

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao

PDF

1 Repo

TL;DR

This paper introduces Simignore, a method that enhances multimodal large language models' complex reasoning by filtering irrelevant image tokens based on similarity to text, improving interpretability and reasoning performance.

Contribution

The paper proposes a novel image token reduction technique, Simignore, that leverages similarity computation to improve complex reasoning in multimodal large language models.

Findings

01

Simignore improves reasoning accuracy on complex tasks

02

Filtering irrelevant image tokens enhances model interpretability

03

The method is validated through extensive experiments

Abstract

Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fanshuozeng/simignore
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need