TL;DR
AeroReformer introduces a novel UAV-specific referring image segmentation framework utilizing a vision-language cross-attention module and rotation-aware multi-scale fusion, supported by newly created UAV datasets and automatic annotation pipeline.
Contribution
The paper presents AeroReformer, the first UAV-specific RIS model with a new dataset and an automatic annotation pipeline leveraging multimodal large language models.
Findings
AeroReformer outperforms existing methods on UAV-RIS datasets.
The automatic labeling pipeline effectively generates training data.
The proposed model establishes a new benchmark for UAV-based referring segmentation.
Abstract
As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localize and segment objects based on textual descriptions. While referring image segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from unmanned aerial vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and Multimodal Large Language Models (MLLM) for generating textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
