Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu; Ziqiao Ma; Wenhao Chai; Xuweiyi Chen; Weiyang Jin; Joyce Chai; Saining Xie; Stella X. Yu

arXiv:2512.16922·cs.CV·December 24, 2025

Next-Embedding Prediction Makes Strong Vision Learners

Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

PDF

Open Access 4 Models

TL;DR

This paper introduces NEPA, a simple yet effective self-supervised vision learning method that trains models to predict future patch embeddings, achieving strong ImageNet and segmentation results without complex auxiliary tasks.

Contribution

Proposes Next-Embedding Predictive Autoregression (NEPA), a novel embedding prediction approach for self-supervised vision learning that simplifies architecture and training.

Findings

01

Achieves 83.8% top-1 accuracy on ImageNet-1K with ViT-B after fine-tuning.

02

Effective transfer to semantic segmentation on ADE20K.

03

No need for pixel reconstruction or contrastive loss.

Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis