CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images
Avishka Perera, Kumal Hewagamage, Saeedha Nazar, Kavishka Abeywardana, Hasitha Gallella, Ranga Rodrigo, Mohamed Afham

TL;DR
CrossJEPA introduces a cross-modal joint-embedding predictive architecture that efficiently learns 3D representations from 2D images, outperforming previous methods in accuracy, efficiency, and resource usage.
Contribution
It proposes a novel cross-modal JEPA framework that leverages a frozen teacher model and a predictor to improve 3D representation learning from 2D data, with state-of-the-art results.
Findings
Achieves 94.2% on ModelNet40 in linear probing
Attains 88.3% on ScanObjectNN in linear probing
Uses only 14.1M pretraining parameters and 6 hours of training
Abstract
Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Human Pose and Action Recognition · Multimodal Machine Learning Applications
