Improving Joint Embedding Predictive Architecture with Diffusion Noise

Yuping Qiu; Rui Zhu; Ying-cong Chen

arXiv:2507.15216·cs.CV·July 22, 2025

Improving Joint Embedding Predictive Architecture with Diffusion Noise

Yuping Qiu, Rui Zhu, Ying-cong Chen

PDF

TL;DR

This paper introduces N-JEPA, a novel self-supervised learning method that integrates diffusion noise with masked image modeling to improve image classification performance.

Contribution

It proposes a new approach combining diffusion noise with masked image modeling, enhancing SSL's representation capacity for recognition tasks.

Findings

01

Improved downstream classification accuracy

02

Enhanced robustness through multi-level noise scheduling

03

Effective integration of diffusion noise with masked image modeling

Abstract

Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.