Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Xiao Xu; Libo Qin; Wanxiang Che; Min-Yen Kan

arXiv:2506.11515·cs.CV·June 16, 2025

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan

PDF

1 Models

TL;DR

This paper introduces Manager, a plugin for Two-Tower VLMs and MLLMs, that adaptively aggregates insights from unimodal experts, leading to improved performance across multiple downstream vision-language tasks and datasets.

Contribution

The paper proposes Manager, a novel plugin that enhances Two-Tower VLMs and MLLMs by adaptively aggregating unimodal insights, improving alignment and fusion in vision-language models.

Findings

01

ManagerTower outperforms previous baselines on 4 VL tasks.

02

LLaVA-OV-Manager boosts zero-shot performance across 20 datasets.

03

The plugin captures diverse visual details, mitigating semantic ambiguity.

Abstract

Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LooperXX/LLaVA-OV-Manager
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.