DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Xiangteng He; Shunsuke Sakai; Shivam Chandhok; Sara Beery; Kun Yuan; Nicolas Padoy; Tatsuhito Hasegawa; Leonid Sigal

arXiv:2511.17354·cs.CV·March 19, 2026

DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Xiangteng He, Shunsuke Sakai, Shivam Chandhok, Sara Beery, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal

PDF

Open Access

TL;DR

DSeq-JEPA introduces a sequential, discriminative approach to self-supervised visual representation learning, inspired by human visual perception, improving the transferability and discriminative power of learned features across various vision tasks.

Contribution

It proposes a novel sequential, discriminative learning architecture that enhances self-supervised visual representations by incorporating attention-based region prediction order.

Findings

01

Outperforms I-JEPA variants on multiple vision tasks

02

Learns more discriminative and generalizable features

03

Effective across classification, detection, segmentation, and reasoning tasks

Abstract

Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully. Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. Specifically, DSeq-JEPA integrates a discriminatively ordered sequential process with JEPA-style learning objective. This is achieved by (i)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning