CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Avishka Perera; Kumal Hewagamage; Saeedha Nazar; Kavishka Abeywardana; Hasitha Gallella; Ranga Rodrigo; Mohamed Afham

arXiv:2511.18424·cs.CV·November 25, 2025

CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images

Avishka Perera, Kumal Hewagamage, Saeedha Nazar, Kavishka Abeywardana, Hasitha Gallella, Ranga Rodrigo, Mohamed Afham

PDF

Open Access

TL;DR

CrossJEPA introduces a cross-modal joint-embedding predictive architecture that efficiently learns 3D representations from 2D images, outperforming previous methods in accuracy, efficiency, and resource usage.

Contribution

It proposes a novel cross-modal JEPA framework that leverages a frozen teacher model and a predictor to improve 3D representation learning from 2D data, with state-of-the-art results.

Findings

01

Achieves 94.2% on ModelNet40 in linear probing

02

Attains 88.3% on ScanObjectNN in linear probing

03

Uses only 14.1M pretraining parameters and 6 hours of training

Abstract

Image-to-point cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning. However, current methods that leverage 2D data often result in large, slow-to-train models, making them computationally expensive and difficult to deploy in resource-constrained environments. The architecture design of such models is therefore critical, determining their performance, memory footprint, and compute efficiency. The Joint-embedding Predictive Architecture (JEPA) has gained wide popularity in self-supervised learning for its simplicity and efficiency, but has been under-explored in cross-modal settings, partly due to the misconception that masking is intrinsic to JEPA. In this light, we propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Pose and Action Recognition · Multimodal Machine Learning Applications