PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Shuyan Ke, Yifan Mei, Changli Wu, Yonghan Zheng, Jiayi Ji, Liujuan Cao, and Rongrong Ji

TL;DR
This paper introduces a new UAV reasoning segmentation task, a large-scale benchmark dataset DRSeg, and a multimodal language model PixDLM to address the unique challenges of UAV imagery analysis.
Contribution
It formally defines UAV reasoning segmentation, creates the DRSeg benchmark dataset, and proposes PixDLM as a baseline model for this task.
Findings
PixDLM achieves strong baseline results on DRSeg.
UAV reasoning segmentation presents unique challenges compared to ground-level scenes.
DRSeg contains 10,000 high-resolution aerial images with Chain-of-Thought QA annotations.
Abstract
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
