UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding
Yang Zhan, Yuan Yuan

TL;DR
This paper introduces UAVBench and UAVIT-1M, new datasets and benchmarks to evaluate and improve multimodal large language models for low-altitude UAV vision-language understanding, addressing current limitations in real-world drone scenarios.
Contribution
The paper presents UAVBench and UAVIT-1M, comprehensive datasets and benchmarks specifically designed for low-altitude UAV vision-language tasks, and demonstrates the effectiveness of fine-tuning open-source MLLMs on these datasets.
Findings
Open-source MLLMs perform poorly on low-altitude UAV tasks.
Fine-tuning on UAVIT-1M significantly improves MLLMs' performance.
Closed-source MLLMs outperform open-source counterparts in this domain.
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
