CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria; Komal Kumar; Xilin He; Imran Razzak; Hisham Cholakkal; Fahad Shahbaz Khan; Salman Khan

arXiv:2604.03231·cs.CV·April 6, 2026

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

PDF

1 Repo 1 Models

TL;DR

CoME-VL introduces a modular fusion framework combining contrastive and self-supervised visual encoders, enhancing vision-language models' performance across various benchmarks.

Contribution

It proposes a novel multi-encoder fusion method with entropy-guided aggregation and cross-attention, improving the integration of diverse visual representations.

Findings

01

Achieves an average of 4.9% improvement on visual understanding tasks.

02

Attains 5.4% better performance on grounding tasks.

03

Sets new state-of-the-art on RefCOCO detection benchmark.

Abstract

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-oryx/CoME-VL
github

Models

🤗
MBZUAI/CoME-VL
model· 9 dl· ♡ 4
9 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.