TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

Zhuoyu Cai; Dou Quan; Ning Huyan; Pei He; Shuang Wang; Licheng Jiao

arXiv:2605.12064·cs.CV·May 13, 2026

TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images

Zhuoyu Cai, Dou Quan, Ning Huyan, Pei He, Shuang Wang, Licheng Jiao

PDF

TL;DR

This paper introduces TAR, a novel framework that leverages text semantic priors to improve cross-modal image registration between optical and SAR images, especially under large deformations.

Contribution

TAR integrates text semantic priors with visual features to enhance cross-modal registration, addressing appearance discrepancies and complex spatial transformations.

Findings

01

TAR outperforms state-of-the-art methods in cross-modal registration accuracy.

02

The framework achieves significant improvements under large geometric deformations.

03

Experimental results validate the effectiveness of text-assisted feature enhancement.

Abstract

Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.