TL;DR
This paper introduces Manager, a plugin for Two-Tower VLMs and MLLMs, that adaptively aggregates insights from unimodal experts, leading to improved performance across multiple downstream vision-language tasks and datasets.
Contribution
The paper proposes Manager, a novel plugin that enhances Two-Tower VLMs and MLLMs by adaptively aggregating unimodal insights, improving alignment and fusion in vision-language models.
Findings
ManagerTower outperforms previous baselines on 4 VL tasks.
LLaVA-OV-Manager boosts zero-shot performance across 20 datasets.
The plugin captures diverse visual details, mitigating semantic ambiguity.
Abstract
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
