Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Jintao Sun; Gangyi Ding; Donglin Di; Hu Zhang; Zhedong Zheng

arXiv:2604.05377·cs.CV·May 8, 2026

Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng

PDF

TL;DR

This paper introduces UAVReason, a large UAV-specific dataset and evaluation suite, and proposes UAVReason-Bagel, a unified model that enhances aerial reasoning and generation tasks, significantly improving performance over existing models.

Contribution

The paper presents UAVReason, a comprehensive UAV-native dataset and evaluation suite, and develops UAVReason-Bagel, a unified model that jointly optimizes reasoning and generation for aerial imagery.

Findings

01

UAVReason-Bagel improves VQA F1 scores significantly over pretrained models.

02

Unified training enhances both reasoning accuracy and generation quality.

03

Synthesis and reasoning mutually benefit, improving aerial scene understanding.

Abstract

Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.