AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation

Rui Li; Xiaowei Zhao

arXiv:2502.16680·cs.CV·September 3, 2025

AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation

Rui Li, Xiaowei Zhao

PDF

1 Repo 1 Models

TL;DR

AeroReformer introduces a novel UAV-specific referring image segmentation framework utilizing a vision-language cross-attention module and rotation-aware multi-scale fusion, supported by newly created UAV datasets and automatic annotation pipeline.

Contribution

The paper presents AeroReformer, the first UAV-specific RIS model with a new dataset and an automatic annotation pipeline leveraging multimodal large language models.

Findings

01

AeroReformer outperforms existing methods on UAV-RIS datasets.

02

The automatic labeling pipeline effectively generates training data.

03

The proposed model establishes a new benchmark for UAV-based referring segmentation.

Abstract

As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localize and segment objects based on textual descriptions. While referring image segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from unmanned aerial vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and Multimodal Large Language Models (MLLM) for generating textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lironui/aeroreformer
noneOfficial

Models

🤗
lironui/AeroReformer
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer