Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation
Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng

TL;DR
This paper introduces UAVReason, a large UAV-specific dataset and evaluation suite, and proposes UAVReason-Bagel, a unified model that enhances aerial reasoning and generation tasks, significantly improving performance over existing models.
Contribution
The paper presents UAVReason, a comprehensive UAV-native dataset and evaluation suite, and develops UAVReason-Bagel, a unified model that jointly optimizes reasoning and generation for aerial imagery.
Findings
UAVReason-Bagel improves VQA F1 scores significantly over pretrained models.
Unified training enhances both reasoning accuracy and generation quality.
Synthesis and reasoning mutually benefit, improving aerial scene understanding.
Abstract
Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
