A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

LinFeng Li; Jian Zhao; Zepeng Yang; Yuhang Song; Bojun Lin; Tianle Zhang; Yuchen Yuan; Chi Zhang; Xuelong Li

arXiv:2510.20291·cs.CV·October 24, 2025

A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin, Tianle Zhang, Yuchen Yuan, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces a novel, efficient mixture-of-experts framework for cross-modal geo-localization, effectively addressing platform heterogeneity and domain gaps, leading to state-of-the-art drone navigation performance.

Contribution

It proposes a platform-specific MoE framework with domain-aligned preprocessing and caption refinement, improving cross-modal geo-localization accuracy across diverse platforms.

Findings

01

Achieved top performance in RoboSense 2025 Track 4

02

Enhanced discriminative power with a three-expert fusion strategy

03

Robust geo-localization across heterogeneous viewpoints

Abstract

We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization