Denoising with a Joint-Embedding Predictive Architecture
Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

TL;DR
This paper introduces D-JEPA, a novel generative modeling approach combining joint-embedding predictive architectures with diffusion and flow matching losses, achieving state-of-the-art results on ImageNet and potential for other data types.
Contribution
It pioneers integrating JEPA into generative modeling, using diffusion and flow matching losses for flexible, scalable data generation across various modalities.
Findings
D-JEPA outperforms previous models on ImageNet benchmarks.
It achieves lower FID scores with fewer training epochs.
Models scale well with increased GFLOPs.
Abstract
Joint-embedding predictive architectures (JEPAs) have shown substantial promise in self-supervised representation learning, yet their application in generative modeling remains underexplored. Conversely, diffusion models have demonstrated significant efficacy in modeling arbitrary probability distributions. In this paper, we introduce Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), pioneering the integration of JEPA within generative modeling. By recognizing JEPA as a form of masked image modeling, we reinterpret it as a generalized next-token prediction strategy, facilitating data generation in an auto-regressive manner. Furthermore, we incorporate diffusion loss to model the per-token probability distribution, enabling data generation in a continuous space. We also adapt flow matching loss as an alternative to diffusion loss, thereby enhancing the flexibility of…
Peer Reviews
Decision·ICLR 2025 Poster
(1) This paper introduces a straightforward and effective approach to enhance generation quality by bridging representation learning and image generation. (2) In the appendix, the authors offer an in-depth analysis of the model design, including additional insights on representation learning as well as applications in video and audio generation, showcasing the versatility of the proposed methods across multiple tasks. (3) The presentation is clear, and the ideas are easy to understand.
(1) The performance is comparable to similar approaches like MAR, which does not use the JEPA loss. In Table 1, the proposed method requires more training epochs (1400 vs. 800 for D-JEPA-B VS ) to achieve similar results to MAR-B. While D-JEPA-L/H outperforms MAR-L/H, it also involves more parameters. Similar trends are observed in Table 2. (2) There is no comparison to baseline methods, such as the effect of removing the JEPA loss. (3) How does the model perform in unconditional generation task
1. Suffieicnt experiments, including generative reuslts on Imagenet-256, sufficient abaltion studies and the exps about representation learning. 2. Authors present the effectiveness of D-JEPA over multi-modalities including videos, images and audio. 3. Authors provide a theretical support about the proposed models. 4. Beyond the generative results, authors provides the empirical study about the linear performance of the proposed D-JEPA with the pixel/latent-level inputs.
1. The organization and the structure of the current version should be improved. The writing of chapter 3 is very confusing for me. And some important results and discussions should be re-located in the main paper not the appendix. 2. The novelity is limited: It seems that this work just replace the MAGE parts of MAR. The simple combination of I-JEPA and MAR's dfiffusion parts. It doesn't solve the core issue between the gap of representation learning and generative modeling. Such an archectur
* Interesting combination of JEPA representation learning with generative AI, showing that representation learning can help generative AI * Strong SOTA results on ImageNet 256x256, better or equal than MAR * When applied for representation learning, the context encoder achieves good results on ImageNet classification * Experiments showing generalization to text conditioned image generation (and not just class conditioned) * Experiments showing that it works with audio as well, class conditioned
* It would be interesting to have generation results at higher resolution than 256x256 px to see if inference speed suffers when bi-directional attention meets lots of tokens
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques
MethodsDiffusion
