SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Markus Gross; Sai Bharadhwaj Matha; Rui Song; Viswanathan Muthuveerappan; Conrad Christoph; Julius Huber; Daniel Cremers

arXiv:2603.17920·cs.CV·March 19, 2026

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers

PDF

Open Access

TL;DR

SegFly introduces a scalable 2D-3D-2D framework leveraging multi-view redundancy and geometry to automatically generate dense, high-quality RGB and thermal annotations for aerial imagery, enabling large-scale multi-modal semantic segmentation.

Contribution

The paper presents a novel geometry-driven 2D-3D-2D paradigm that automates label propagation and RGB-T alignment, significantly reducing manual effort and expanding the scale of aerial semantic segmentation datasets.

Findings

01

Automatically generates 97% of RGB labels and 100% of thermal labels with high accuracy.

02

Achieves 87% registration accuracy for RGB-T alignment without hardware synchronization.

03

Constructs a large-scale, diverse aerial dataset with over 20,000 images and 15,000 RGB-T pairs.

Abstract

Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques