VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen; Mustafa Shukor; Theo Moutakanni; Willy Chung; Jade Yu; Tejaswi Kasarla; Yejin Bang; Allen Bolourchi; Yann LeCun; Pascale Fung

arXiv:2512.10942·cs.CV·February 3, 2026

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung

PDF

Open Access 3 Reviews

TL;DR

VL-JEPA introduces a vision-language model that predicts continuous text embeddings in an abstract space, achieving better performance with fewer parameters and supporting versatile tasks like classification, retrieval, and VQA.

Contribution

The paper presents VL-JEPA, a novel vision-language model that predicts embeddings instead of tokens, improving efficiency and versatility over traditional token-based models.

Findings

01

Outperforms standard token-space VLMs with 50% fewer parameters.

02

Supports selective decoding, reducing decoding operations by 2.85x.

03

Achieves state-of-the-art results on multiple video classification and retrieval datasets.

Abstract

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1.The paper's core contribution—applying a predictive JEPA-style objective to the cross-modal VL problem—is highly novel. Moving VL learning from the discrete token space to a continuous semantic space is a well-motivated and promising direction to address the known efficiency and latency bottlenecks of standard generative VLMs. 2.The "selective decoding" mechanism (Sec 4.3) is a standout contribution. The idea of monitoring the latent embedding stream for semantic variance and only triggering

Weaknesses

1. While the paper emphasizes that the shift from token-space to embedding-space simplifies the target distribution, it provides no rigorous analysis of the measurability and discriminability of the resulting semantic embedding space. To substantiate this claim, the authors should provide supplementary analysis, such as: (i) Visualization or quantitative studies on the embedding space's structure (e.g., its clustering properties, separability) to demonstrate this claimed simplification. (ii)

Reviewer 02Rating 4Confidence 2

Strengths

1 - The paper is clearly written and easy to follow and understand. 2 - A new vision-language model leveraging JEPA architecture instead of regular transformer decoders. 3 - Comparable performance to existing transformer-based VLMs, with less parameters. 4 - Extensive details are given about the training setup and resources.

Weaknesses

1 - The model seem to be focused on video understanding as most of the training data are related to this task. This raises questions about the comparision to other VLMs that are trained and designed to be more generalist. 2 - Experiments focus on only a subset of use cases of a vision-language model (video understanding). More experiements on other types of tasks wuold have been appreciated (e.g., MMMU, OCRBench, DocVQA, etc.). If the JEPA architecture is intended to replace transformer-based

Reviewer 03Rating 8Confidence 3

Strengths

The paper introduces (or successfully reapplies) the JEPA architecture to the VL setting. - shows notable gains in both training speed and performance on zero-shot video captioning and classification (fig 3) while using a well-argued training procedure (JEPA). - shows non-trivial adaptation to retrieval and open-label classification. E.g. seen on youcook2, MSR-VTT (table 6). - The paper explores underexplored areas in the field, by exploring alternatives to generative token decoding, resulting

Weaknesses

- The relevant benchmarks used for evaluation are only briefly introduced. I would have loved to see a more substantial justification for choosing these specifically. - While it is stated that the model and code will be open source, it could be shared through existing anonymous platforms - VL-JEPA is in Table 6 compared to contrastively-trained models. It is implicitly argued that this is the reason for the subpar performance on some of the tasks. I would have loved to see a contrastive adaptati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques