LMEye: An Interactive Perception Network for Large Language Models
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang

TL;DR
LMEye introduces an interactive perception network that enables large language models to dynamically request and interact with visual information based on human queries, significantly enhancing multimodal task performance.
Contribution
The paper presents LMEye, a novel interactive perception network allowing LLMs to request and interact with visual information dynamically, improving multimodal understanding and response accuracy.
Findings
Significantly improves zero-shot multimodal task performance
Achieves better results with fewer parameters
Enables dynamic visual information interaction based on human queries
Abstract
Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4, is resource-intensive. Regarding Large Language Models (LLMs) as the core processor for multimodal information, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external vision information. Previous methods incorporate visual information into LLMs with a simple visual mapping network or Q-former from BLIP-2. Such networks project the image feature once yet do not consider the interaction between the image and the human input query. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. LMEye addresses this issue by allowing the LLM to request the desired visual…
Peer Reviews
Decision·Submitted to ICLR 2024
- The paper presents a variety of evaluation benchmarks and shows the potential of LLMEye - Extending the capabilities of Multimodal-Models by training lightweight modules is an interesting direction also proposed by previous work (e.g MiniGPT4)
- Figure 1 could be improved a lot. First, I will suggest including and specifying each component of LMEye. Adding this will help the reader to connect the notation mentioned in Section 3.1 and the flow of the figure - Ablations. - It would be beneficial to understand the contribution of the RVII module if authors include ablations with and without that module in MME and SEED-Bench benchmarks. - I also suggest specifying the Vision Model and Language Model with their corresponding parameters t
+ The overall motivation of the paper is sound. The authors add dynamic attention to better summarize the information on the visual feature map. + The presentation is mostly clear. + The experimental results support the claims.
Overall, the paper lacks significant enough contribution to vision and language community: - The so called "interactive perception model" is actually a standard and common technique used everywhere. Early since attention was proposed, the visual information is dynamically summarized (attended). Some old works on VQA, such as NMN, have extensively used instruction-based condition to process visual information. - The network/module themselves are also pretty common, and do not convey any significa
- The motivation and implementation are clear and straightforward. The visual information should interact with language queries in multimodal LLMs. Current MLLMs does not point out this issue but I think it is important for deeper understanding of the visual signal. - The evaluation on multiple tasks are promising. I believe LMEye can serve as a strong baseline for future works. - A new VQA dataset with long answer. Traditional VQA datasets are not suitable for current MLLMs since their answers
- As the core contribution of the paper, the authors did not carefully explore the contribution of RVII module. For example, the impact of RVII module under the same data and training process. - In my view, I think there is no essential difference between RVII and LoRA. LoRA is too insert some parameters inside the LLM , while RVII is more like an Adapter module outside LLM. The authors claim that visual feature from RVII is dynamic and conditioned on human queries. Since the decoding process o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Layer Normalization · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Attention Is All You Need · Absolute Position Encodings · Dense Connections · Adam
