General-Purpose Multimodal Transformer meets Remote Sensing Semantic Segmentation
Nhi Kieu, Kien Nguyen, Sridha Sridharan, Clinton Fookes

TL;DR
This paper explores the use of a general-purpose multimodal transformer, PerceiverIO, for remote sensing semantic segmentation, addressing its limitations with a novel 3D convolutional module to improve object detection and scale variation handling.
Contribution
It introduces a UNet-inspired 3D convolutional module to enhance PerceiverIO's performance in remote sensing segmentation tasks, reducing the need for specialized architectures.
Findings
PerceiverIO struggles with object scale variation and car detection in remote sensing images.
The proposed 3D convolutional module improves detection and segmentation performance.
The method achieves competitive results with specialized architectures like UNetFormer and SwinUNet.
Abstract
The advent of high-resolution multispectral/hyperspectral sensors, LiDAR DSM (Digital Surface Model) information and many others has provided us with an unprecedented wealth of data for Earth Observation. Multimodal AI seeks to exploit those complementary data sources, particularly for complex tasks like semantic segmentation. While specialized architectures have been developed, they are highly complicated via significant effort in model design, and require considerable re-engineering whenever a new modality emerges. Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance across multiple multimodal tasks with one unified architecture. In this work, we investigate the performance of PerceiverIO, one in the general-purpose multimodal family, in the remote sensing semantic segmentation domain. Our experiments reveal that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsConvolution · 3D Convolution
