Towards Cross-View Point Correspondence in Vision-Language Models
Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng

TL;DR
This paper introduces a new benchmark, dataset, and model for achieving precise cross-view point correspondence in vision-language models, addressing a key challenge in spatial understanding and embodied AI.
Contribution
The paper presents the CrossPoint-Bench benchmark, CrossPoint-378K dataset, and CroPond model, advancing the state-of-the-art in fine-grained cross-view correspondence in VLMs.
Findings
State-of-the-art models lag behind humans by over 54% in accuracy.
CroPond surpasses Gemini-2.5-Pro by 39.7% on the benchmark.
The dataset and benchmark facilitate progress in spatial understanding tasks.
Abstract
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
