Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence

Dian Liu; Jie Feng; Di Li; Yuhui Zheng; Guanbin Li; Weisheng Dong; Guangming Shi

arXiv:2604.02020·cs.CV·April 3, 2026

Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence

Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li, Weisheng Dong, Guangming Shi

PDF

TL;DR

This paper introduces LinkS$^2$Bench, a comprehensive benchmark linking UAV and satellite imagery to evaluate and improve Vision-Language Models' dynamic cross-view spatial reasoning capabilities.

Contribution

It presents the first benchmark for dynamic UAV-satellite spatial intelligence, including a large dataset, annotated tasks, and a novel alignment method to enhance VLM performance.

Findings

01

VLMs perform substantially worse than humans on the benchmark.

02

Explicit cross-view alignment improves VLM accuracy.

03

Fine-tuning on LinkS$^2$Bench enhances spatial reasoning abilities.

Abstract

Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS $^{2}$ Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS $^{2}$ Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km $^{2}$ . Through an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.