Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images
Bissmella Bahaduri, Zuheng Ming, Fangchen Feng, Anissa Mokraou

TL;DR
This paper introduces a novel early-stage multi-modal fusion method using cross-channel attention and an enhanced SWIN transformer, significantly improving small object detection in remote sensing images.
Contribution
The paper proposes an early-stage multi-modal fusion strategy with cross-attention and an augmented SWIN transformer, addressing computational complexity and small object detection challenges.
Findings
Achieves superior detection performance compared to existing methods.
Effectively fuses multiple modalities at early stages for better accuracy.
Enhances SWIN transformer with convolution layers to improve local attention.
Abstract
Object detection in Remote Sensing Images (RSI) is a critical task for numerous applications in Earth Observation (EO). Differing from object detection in natural images, object detection in remote sensing images faces challenges of scarcity of annotated data and the presence of small objects represented by only a few pixels. Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities such as RGB, infrared (IR), lidar, and synthetic aperture radar (SAR). To this end, the fusion of representations at the mid or late stage, produced by parallel subnetworks, is dominant, with the disadvantages of increasing computational complexity in the order of the number of modalities and the creation of additional engineering obstacles. Using the cross-attention mechanism, we propose a novel multi-modal fusion strategy for mapping relationships between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Advanced Image and Video Retrieval Techniques · Infrared Target Detection Methodologies
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Stochastic Depth · Softmax · Dense Connections · Residual Connection · Swin Transformer · Convolution
