Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li; Yang Jin; Hao Jiang; Yadong Mu; Yang Song; Kun Xu

arXiv:2512.21004·cs.CV·December 25, 2025

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu

PDF

Open Access

TL;DR

This paper introduces NExT-Vid, an autoregressive video modeling framework that improves semantic localization and generation quality by predicting next frames, leading to better visual representations for downstream tasks.

Contribution

The paper proposes a novel autoregressive pretraining method for videos that decouples semantic encoding from decoding, enhancing both representation quality and generation diversity.

Findings

01

Outperforms previous generative pretraining methods in downstream classification tasks.

02

Achieves strong semantic representations through context-isolated flow-matching pretraining.

03

Demonstrates improved generation quality and diversity in video modeling.

Abstract

Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning