Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching

Abu Sadat Mohammad Salehin Amit; Xiaoli Zhang; Md Masum Billa Shagar; Zhaojun Liu; Xiongfei Li; Fanlong Meng

arXiv:2507.19118·cs.CV·October 3, 2025

Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching

Abu Sadat Mohammad Salehin Amit, Xiaoli Zhang, Md Masum Billa Shagar, Zhaojun Liu, Xiongfei Li, Fanlong Meng

PDF

TL;DR

This paper introduces a novel Cross Spatial Temporal Fusion (CSTF) mechanism that improves cross-modal remote sensing image matching by integrating scale-invariant keypoints and reformulating similarity as a classification task, leading to state-of-the-art object detection results.

Contribution

The paper presents a new CSTF method that enhances feature matching across remote sensing modalities by combining keypoint-based correspondence maps with a classification-based similarity measure.

Findings

01

Achieves state-of-the-art mAP of 90.99% on HRSC2016

02

Achieves state-of-the-art mAP of 90.86% on DOTA

03

Maintains real-time inference speed of 12.5 FPS

Abstract

Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.