Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li, Jian Yang

TL;DR
This paper introduces BabelRS, a pretraining framework that uses language as a semantic pivot to improve multi-modal remote sensing object detection, addressing existing challenges in modality alignment and training stability.
Contribution
BabelRS explicitly decouples modality alignment from task learning using language, with novel components CSIA and LVSA for improved stability and performance.
Findings
Outperforms state-of-the-art methods on remote sensing detection tasks.
Stabilizes training compared to existing approaches.
Effectively bridges heterogeneous sensor modalities using language.
Abstract
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
