Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier; Aladine Chetouani; Aur\'elien Chateigner

arXiv:2507.11040·cs.CV·July 16, 2025

Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier, Aladine Chetouani, Aur\'elien Chateigner

PDF

Open Access

TL;DR

This paper introduces GLOD, a transformer-based architecture optimized for object detection in high-resolution satellite images, achieving state-of-the-art performance through novel fusion and upsampling techniques.

Contribution

The paper proposes a transformer-first model with innovative fusion and upsampling modules, surpassing existing methods in satellite imagery object detection.

Findings

01

Achieves 32.95% mAP on xView dataset

02

Outperforms SOTA methods by 11.46%

03

Introduces asymmetric fusion with CBAM attention

Abstract

We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · Remote-Sensing Image Classification · Advanced Image and Video Retrieval Techniques