Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching
Abu Sadat Mohammad Salehin Amit, Xiaoli Zhang, Md Masum Billa Shagar, Zhaojun Liu, Xiongfei Li, Fanlong Meng

TL;DR
This paper introduces a novel Cross Spatial Temporal Fusion (CSTF) mechanism that improves cross-modal remote sensing image matching by integrating scale-invariant keypoints and reformulating similarity as a classification task, leading to state-of-the-art object detection results.
Contribution
The paper presents a new CSTF method that enhances feature matching across remote sensing modalities by combining keypoint-based correspondence maps with a classification-based similarity measure.
Findings
Achieves state-of-the-art mAP of 90.99% on HRSC2016
Achieves state-of-the-art mAP of 90.86% on DOTA
Maintains real-time inference speed of 12.5 FPS
Abstract
Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
