A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin, Tianle Zhang, Yuchen Yuan, Chi Zhang, Xuelong Li

TL;DR
This paper introduces a novel, efficient mixture-of-experts framework for cross-modal geo-localization, effectively addressing platform heterogeneity and domain gaps, leading to state-of-the-art drone navigation performance.
Contribution
It proposes a platform-specific MoE framework with domain-aligned preprocessing and caption refinement, improving cross-modal geo-localization accuracy across diverse platforms.
Findings
Achieved top performance in RoboSense 2025 Track 4
Enhanced discriminative power with a three-expert fusion strategy
Robust geo-localization across heterogeneous viewpoints
Abstract
We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
