Denoising with a Joint-Embedding Predictive Architecture

Dengsheng Chen; Jie Hu; Xiaoming Wei; Enhua Wu

arXiv:2410.03755·cs.LG·February 5, 2025

Denoising with a Joint-Embedding Predictive Architecture

Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces D-JEPA, a novel generative modeling approach combining joint-embedding predictive architectures with diffusion and flow matching losses, achieving state-of-the-art results on ImageNet and potential for other data types.

Contribution

It pioneers integrating JEPA into generative modeling, using diffusion and flow matching losses for flexible, scalable data generation across various modalities.

Findings

01

D-JEPA outperforms previous models on ImageNet benchmarks.

02

It achieves lower FID scores with fewer training epochs.

03

Models scale well with increased GFLOPs.

Abstract

Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

(1) This paper introduces a straightforward and effective approach to enhance generation quality by bridging representation learning and image generation. (2) In the appendix, the authors offer an in-depth analysis of the model design, including additional insights on representation learning as well as applications in video and audio generation, showcasing the versatility of the proposed methods across multiple tasks. (3) The presentation is clear, and the ideas are easy to understand.

Weaknesses

(1) The performance is comparable to similar approaches like MAR, which does not use the JEPA loss. In Table 1, the proposed method requires more training epochs (1400 vs. 800 for D-JEPA-B VS ) to achieve similar results to MAR-B. While D-JEPA-L/H outperforms MAR-L/H, it also involves more parameters. Similar trends are observed in Table 2. (2) There is no comparison to baseline methods, such as the effect of removing the JEPA loss. (3) How does the model perform in unconditional generation task

Reviewer 02Rating 5Confidence 5

Strengths

1. Suffieicnt experiments, including generative reuslts on Imagenet-256, sufficient abaltion studies and the exps about representation learning. 2. Authors present the effectiveness of D-JEPA over multi-modalities including videos, images and audio. 3. Authors provide a theretical support about the proposed models. 4. Beyond the generative results, authors provides the empirical study about the linear performance of the proposed D-JEPA with the pixel/latent-level inputs.

Weaknesses

1. The organization and the structure of the current version should be improved. The writing of chapter 3 is very confusing for me. And some important results and discussions should be re-located in the main paper not the appendix. 2. The novelity is limited: It seems that this work just replace the MAGE parts of MAR. The simple combination of I-JEPA and MAR's dfiffusion parts. It doesn't solve the core issue between the gap of representation learning and generative modeling. Such an archectur

Reviewer 03Rating 8Confidence 3

Strengths

* Interesting combination of JEPA representation learning with generative AI, showing that representation learning can help generative AI * Strong SOTA results on ImageNet 256x256, better or equal than MAR * When applied for representation learning, the context encoder achieves good results on ImageNet classification * Experiments showing generalization to text conditioned image generation (and not just class conditioned) * Experiments showing that it works with audio as well, class conditioned

Weaknesses

* It would be interesting to have generation results at higher resolution than 256x256 px to see if inference speed suffers when bi-directional attention meets lots of tokens

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques

MethodsDiffusion