DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
Xiangteng He, Shunsuke Sakai, Shivam Chandhok, Sara Beery, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal

TL;DR
DSeq-JEPA introduces a sequential, discriminative approach to self-supervised visual representation learning, inspired by human visual perception, improving the transferability and discriminative power of learned features across various vision tasks.
Contribution
It proposes a novel sequential, discriminative learning architecture that enhances self-supervised visual representations by incorporating attention-based region prediction order.
Findings
Outperforms I-JEPA variants on multiple vision tasks
Learns more discriminative and generalizable features
Effective across classification, detection, segmentation, and reasoning tasks
Abstract
Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully. Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. Specifically, DSeq-JEPA integrates a discriminatively ordered sequential process with JEPA-style learning objective. This is achieved by (i)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
