Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens
Ciem Cornelissen, Sam Leroux, Pieter Simoens

TL;DR
Le MuMo JEPA introduces a self-supervised multi-modal learning framework that fuses RGB images with companion modalities like LiDAR and thermal data, enhancing visual representations for autonomous driving tasks.
Contribution
It extends LeJEPA to multi-modal data by learning fusion tokens as a latent bottleneck, improving efficiency and performance in multimodal representation learning.
Findings
Outperforms baselines on Waymo, nuScenes, and FLIR benchmarks.
Achieves better detection, depth, and segmentation with lower compute and memory.
Maintains strong accuracy-efficiency balance across tasks.
Abstract
Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
