Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery
Nicolas Drapier, Aladine Chetouani, Aur\'elien Chateigner

TL;DR
This paper introduces GLOD, a transformer-based architecture optimized for object detection in high-resolution satellite images, achieving state-of-the-art performance through novel fusion and upsampling techniques.
Contribution
The paper proposes a transformer-first model with innovative fusion and upsampling modules, surpassing existing methods in satellite imagery object detection.
Findings
Achieves 32.95% mAP on xView dataset
Outperforms SOTA methods by 11.46%
Introduces asymmetric fusion with CBAM attention
Abstract
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · Remote-Sensing Image Classification · Advanced Image and Video Retrieval Techniques
