Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Yikang Zhou; Tao Zhang; Shilin Xu; Shihao Chen; Qianyu Zhou; Yunhai Tong; Shunping Ji; Jiangning Zhang; Lu Qi; Xiangtai Li

arXiv:2501.04670·cs.CV·July 10, 2025

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Lu Qi, Xiangtai Li

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a new benchmark and dataset for evaluating visual correspondence in multimodal large language models, revealing current shortcomings and proposing a contrastive model that outperforms existing models on this task.

Contribution

The paper presents the first visual correspondence dataset and benchmark for MLLMs, along with a novel contrastive model, CoLVA, that improves visual matching performance.

Findings

01

CoLVA achieves 49.80% accuracy on MMVM benchmark.

02

MMVM benchmark reveals systematic shortcomings in current MLLMs.

03

The dataset and benchmark facilitate comprehensive evaluation of visual matching abilities.

Abstract

Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhouyiks/colva
pytorchOfficial

Models

🤗
zhouyik/colva_internvl2_4b
model· 7 dl· ♡ 1
7 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · linguistics and terminology studies · Translation Studies and Practices

MethodsShrink and Fine-Tune · Contrastive Learning