# MS2-CL: Multi-Scale Self-Supervised Learning for Camera to LiDAR Cross-Modal Place Recognition

**Authors:** Wen Liu, Lei Ma, Xuanshun Zhuang, Zhongliang Deng

PMC · DOI: 10.3390/s26051561 · Sensors (Basel, Switzerland) · 2026-03-02

## TL;DR

This paper introduces MS2-CL, a new method for cross-modal place recognition that improves accuracy and efficiency by learning a unified embedding space from camera and LiDAR data.

## Contribution

The novel multi-scale self-distillation paradigm enables scale-invariant feature learning for cross-modal place recognition.

## Key findings

- MS2-CL achieves state-of-the-art performance on KITTI and KITTI-360 datasets.
- Recall@1 exceeds 60% on KITTI-360 sequences at a 10 m threshold without fine-tuning.

## Abstract

Place recognition is a fundamental challenge for robotics and autonomous vehicles. While visual place recognition has achieved high precision, cross-modal place recognition—specifically, visual localization within large-scale point cloud maps—remains a formidable problem. Existing methods often struggle with the significant domain gap between modalities and can be computationally prohibitive, especially those processing raw 3D point clouds. Furthermore, they frequently fail to learn features invariant to viewpoint and scale variations, limiting generalization to unseen environments. In this paper, we formulate cross-modal recognition as a problem of learning a scale-invariant, unified embedding space. Our framework employs a hierarchical Swin Transformer to extract multi-scale features from unified 2D representations of both modalities. The central principle of our method is a multi-scale self-distillation paradigm, which recasts feature learning as an intra-modal knowledge transfer task. Specifically, the coarse-scale “teacher” features provide supervision for the fine-scale “student” features. The final inter-modal alignment is then achieved via a global contrastive loss, exclusively leveraging the semantically rich “teacher” embeddings to ensure a reliable and discriminative matching. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate that our method achieves state-of-the-art performance. Notably, using only the KITTI-trained model without fine-tuning, Recall@1 exceeds 60% on all evaluable sequences of KITTI-360 at a 10 m threshold. Code and pre-trained models will be made publicly available upon acceptance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12987093/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12987093/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/PMC12987093/full.md

---
Source: https://tomesphere.com/paper/PMC12987093