Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang; Yuheng Ji; Yuyang Liu; Enshen Zhou; Ziqiang Yang; Yuxuan Tian; Ziheng Qin; Yue Liu; Huajie Tan; Cheng Chi; Zhiyuan Ma; Daniel Dajun Zeng; Xiaolong Zheng

arXiv:2512.04686·cs.CV·December 9, 2025

Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, Xiaolong Zheng

PDF

Open Access 2 Models 2 Datasets

TL;DR

This paper introduces a new benchmark, dataset, and model for achieving precise cross-view point correspondence in vision-language models, addressing a key challenge in spatial understanding and embodied AI.

Contribution

The paper presents the CrossPoint-Bench benchmark, CrossPoint-378K dataset, and CroPond model, advancing the state-of-the-art in fine-grained cross-view correspondence in VLMs.

Findings

01

State-of-the-art models lag behind humans by over 54% in accuracy.

02

CroPond surpasses Gemini-2.5-Pro by 39.7% on the benchmark.

03

The dataset and benchmark facilitate progress in spatial understanding tasks.

Abstract

Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications