Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models
Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang, Lai, Bin Li, Xiangyang Xue

TL;DR
This paper introduces an instruction-guided fusion method for multi-layer visual features in large vision-language models, improving task-specific performance by dynamically integrating hierarchical features based on textual instructions.
Contribution
The paper systematically analyzes multi-layer visual features in LVLMs and proposes a dynamic, instruction-guided fusion module that enhances task-specific feature integration without increasing tokens.
Findings
Multilayer features offer complementary strengths across tasks.
Uniform fusion of features is suboptimal for diverse tasks.
The proposed method outperforms existing approaches on multiple benchmarks.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
