CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu; Chunbo Luo

arXiv:2605.09802·cs.CV·May 12, 2026

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu, Chunbo Luo

PDF

TL;DR

CrossVL introduces a novel framework combining complexity-aware feature routing and paired curriculum learning to improve cross-view vision-language detection, addressing viewpoint-induced geometric and complexity variations.

Contribution

It proposes CPA and PCL, novel methods for scene complexity estimation and curriculum-based training, enhancing robustness and performance in cross-view detection tasks.

Findings

01

CrossVL improves aerial mAP from 58.66% to 61.03%.

02

Reduces ground-aerial performance gap from 8.63pp to 6.65pp.

03

Achieves 3.3x reduction in variance across seeds.

Abstract

Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.