LMEye: An Interactive Perception Network for Large Language Models

Yunxin Li; Baotian Hu; Xinyu Chen; Lin Ma; Yong Xu; and Min Zhang

arXiv:2305.03701·cs.CV·September 29, 2023·6 cites

LMEye: An Interactive Perception Network for Large Language Models

Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

LMEye introduces an interactive perception network that enables large language models to dynamically request and interact with visual information based on human queries, significantly enhancing multimodal task performance.

Contribution

The paper presents LMEye, a novel interactive perception network allowing LLMs to request and interact with visual information dynamically, improving multimodal understanding and response accuracy.

Findings

01

Significantly improves zero-shot multimodal task performance

02

Achieves better results with fewer parameters

03

Enables dynamic visual information interaction based on human queries

Abstract

Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4, is resource-intensive. Regarding Large Language Models (LLMs) as the core processor for multimodal information, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external vision information. Previous methods incorporate visual information into LLMs with a simple visual mapping network or Q-former from BLIP-2. Such networks project the image feature once yet do not consider the interaction between the image and the human input query. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. LMEye addresses this issue by allowing the LLM to request the desired visual…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The paper presents a variety of evaluation benchmarks and shows the potential of LLMEye - Extending the capabilities of Multimodal-Models by training lightweight modules is an interesting direction also proposed by previous work (e.g MiniGPT4)

Weaknesses

- Figure 1 could be improved a lot. First, I will suggest including and specifying each component of LMEye. Adding this will help the reader to connect the notation mentioned in Section 3.1 and the flow of the figure - Ablations. - It would be beneficial to understand the contribution of the RVII module if authors include ablations with and without that module in MME and SEED-Bench benchmarks. - I also suggest specifying the Vision Model and Language Model with their corresponding parameters t

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

+ The overall motivation of the paper is sound. The authors add dynamic attention to better summarize the information on the visual feature map. + The presentation is mostly clear. + The experimental results support the claims.

Weaknesses

Overall, the paper lacks significant enough contribution to vision and language community: - The so called "interactive perception model" is actually a standard and common technique used everywhere. Early since attention was proposed, the visual information is dynamically summarized (attended). Some old works on VQA, such as NMN, have extensively used instruction-based condition to process visual information. - The network/module themselves are also pretty common, and do not convey any significa

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The motivation and implementation are clear and straightforward. The visual information should interact with language queries in multimodal LLMs. Current MLLMs does not point out this issue but I think it is important for deeper understanding of the visual signal. - The evaluation on multiple tasks are promising. I believe LMEye can serve as a strong baseline for future works. - A new VQA dataset with long answer. Traditional VQA datasets are not suitable for current MLLMs since their answers

Weaknesses

- As the core contribution of the paper, the authors did not carefully explore the contribution of RVII module. For example, the impact of RVII module under the same data and training process. - In my view, I think there is no essential difference between RVII and LoRA. LoRA is too insert some parameters inside the LLM , while RVII is more like an Adapter module outside LLM. The authors claim that visual feature from RVII is dynamic and conditioned on human queries. Since the decoding process o

Code & Models

Repositories

yunxinli/lingcloud
pytorchOfficial

Datasets

YunxinLi/Multimodal_Instruction_data_v1
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Attention Is All You Need · Absolute Position Encodings · Dense Connections · Adam