Dense Connector for MLLMs
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang,, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang

TL;DR
This paper introduces the Dense Connector, a simple plug-and-play module that leverages multi-layer visual features to significantly enhance multimodal large language models, achieving state-of-the-art results with minimal additional computational cost.
Contribution
The paper proposes the Dense Connector and its efficient variant, which effectively utilize multi-layer visual features to improve MLLMs without substantial computational overhead.
Findings
Achieves state-of-the-art performance on 19 image and video benchmarks.
Enables zero-shot video understanding with models trained only on images.
Maintains high performance with only 25% of visual tokens.
Abstract
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPhotonic and Optical Devices · Ferroelectric and Negative Capacitance Devices · Analog and Mixed-Signal Circuit Design
MethodsFocus
