OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Xiaoyu Tang; Jun Dong; Jintao Cheng; and Rui Fan

arXiv:2603.24876·cs.CV·March 27, 2026

OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Xiaoyu Tang, Jun Dong, Jintao Cheng, and Rui Fan

PDF

Open Access

TL;DR

This paper introduces OptiSAR-Net++, a novel framework for cross-domain remote sensing visual grounding that leverages a new large-scale benchmark dataset, employing efficient feature decoupling and cross-modal matching techniques to improve accuracy and computational efficiency.

Contribution

The paper presents OptiSAR-RSVG, the first large-scale cross-domain remote sensing visual grounding dataset, and proposes OptiSAR-Net++, a transformer-free framework with innovative modules for enhanced semantic and spatial modeling.

Findings

01

Achieves state-of-the-art performance on OptSAR-RSVG and DIOR-RSVG benchmarks.

02

Demonstrates improved localization accuracy and efficiency over existing methods.

03

Effectively handles cross-domain feature modeling and fine-grained semantic discrimination.

Abstract

Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications