DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture
Shentong Mo, Sukmin Yun

TL;DR
DMT-JEPA enhances joint-embedding predictive architecture by introducing discriminative masked targets based on neighboring patches, significantly improving local semantic understanding and performance across multiple visual tasks.
Contribution
We propose DMT-JEPA, a novel masked modeling objective that generates discriminative latent targets from neighboring patches to improve local semantic understanding.
Findings
Improves performance on ImageNet-1K classification
Enhances semantic segmentation accuracy on ADE20K
Boosts object detection results on COCO
Abstract
The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsSparse Evolutionary Training
