Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Lu Qi, Xiangtai Li

TL;DR
This paper introduces a new benchmark and dataset for evaluating visual correspondence in multimodal large language models, revealing current shortcomings and proposing a contrastive model that outperforms existing models on this task.
Contribution
The paper presents the first visual correspondence dataset and benchmark for MLLMs, along with a novel contrastive model, CoLVA, that improves visual matching performance.
Findings
CoLVA achieves 49.80% accuracy on MMVM benchmark.
MMVM benchmark reveals systematic shortcomings in current MLLMs.
The dataset and benchmark facilitate comprehensive evaluation of visual matching abilities.
Abstract
Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · linguistics and terminology studies · Translation Studies and Practices
MethodsShrink and Fine-Tune · Contrastive Learning
