Instruction-Guided Fusion of Multi-Layer Visual Features in Large   Vision-Language Models

Xu Li; Yi Zheng; Haotian Chen; Xiaolei Chen; Yuxuan Liang; Chenghang; Lai; Bin Li; Xiangyang Xue

arXiv:2501.08443·cs.CV·January 20, 2025

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang, Lai, Bin Li, Xiangyang Xue

PDF

Open Access

TL;DR

This paper introduces an instruction-guided fusion method for multi-layer visual features in large vision-language models, improving task-specific performance by dynamically integrating hierarchical features based on textual instructions.

Contribution

The paper systematically analyzes multi-layer visual features in LVLMs and proposes a dynamic, instruction-guided fusion module that enhances task-specific feature integration without increasing tokens.

Findings

01

Multilayer features offer complementary strengths across tasks.

02

Uniform fusion of features is suboptimal for diverse tasks.

03

The proposed method outperforms existing approaches on multiple benchmarks.

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques