UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

Yang Zhan; Yuan Yuan

arXiv:2603.14336·cs.CV·March 17, 2026

UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

Yang Zhan, Yuan Yuan

PDF

Open Access

TL;DR

This paper introduces UAVBench and UAVIT-1M, new datasets and benchmarks to evaluate and improve multimodal large language models for low-altitude UAV vision-language understanding, addressing current limitations in real-world drone scenarios.

Contribution

The paper presents UAVBench and UAVIT-1M, comprehensive datasets and benchmarks specifically designed for low-altitude UAV vision-language tasks, and demonstrates the effectiveness of fine-tuning open-source MLLMs on these datasets.

Findings

01

Open-source MLLMs perform poorly on low-altitude UAV tasks.

02

Fine-tuning on UAVIT-1M significantly improves MLLMs' performance.

03

Closed-source MLLMs outperform open-source counterparts in this domain.

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques