PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Shuyan Ke; Yifan Mei; Changli Wu; Yonghan Zheng; Jiayi Ji; Liujuan Cao; and Rongrong Ji

arXiv:2604.15670·cs.CV·April 20, 2026

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Shuyan Ke, Yifan Mei, Changli Wu, Yonghan Zheng, Jiayi Ji, Liujuan Cao, and Rongrong Ji

PDF

TL;DR

This paper introduces a new UAV reasoning segmentation task, a large-scale benchmark dataset DRSeg, and a multimodal language model PixDLM to address the unique challenges of UAV imagery analysis.

Contribution

It formally defines UAV reasoning segmentation, creates the DRSeg benchmark dataset, and proposes PixDLM as a baseline model for this task.

Findings

01

PixDLM achieves strong baseline results on DRSeg.

02

UAV reasoning segmentation presents unique challenges compared to ground-level scenes.

03

DRSeg contains 10,000 high-resolution aerial images with Chain-of-Thought QA annotations.

Abstract

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.