UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Jiajin Guan (1); Haibo Mei (2); Bonan Zhang (1); Dan Liu (1); Yuanshuang Fu (1); Yue Zhang (2) ((1) Research Institute of Electronic Science; Technology; University of Electronic Science; Technology of China; Chengdu; China; (2) School of Aeronautics; Astronautics; University of Electronic Science; Technology of China; Chengdu; China)

arXiv:2508.11196·cs.CV·May 7, 2026

UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Jiajin Guan (1), Haibo Mei (2), Bonan Zhang (1), Dan Liu (1), Yuanshuang Fu (1), Yue Zhang (2) ((1) Research Institute of Electronic Science, Technology, University of Electronic Science, Technology of China, Chengdu, China, (2) School of Aeronautics, Astronautics

PDF

TL;DR

UAV-VL-R1 is a lightweight vision-language model tailored for UAV aerial reasoning, trained with supervised fine-tuning and multi-stage reinforcement learning, achieving high accuracy and efficiency on UAV-specific tasks.

Contribution

The paper introduces UAV-VL-R1, a novel UAV-specific VLM trained with hybrid supervised and reinforcement learning methods, and provides a new high-resolution UAV reasoning dataset.

Findings

01

UAV-VL-R1 outperforms larger models in zero-shot accuracy on UAV tasks.

02

The model is memory-efficient, suitable for real-time UAV deployment.

03

GRPO reinforcement learning enhances logical reasoning and inference robustness.

Abstract

Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.