VideoMAR: Autoregressive Video Generatio with Continuous Tokens

Hu Yu; Biao Gong; Hangjie Yuan; DanDan Zheng; Weilong Chai; Jingdong Chen; Kecheng Zheng; Feng Zhao

arXiv:2506.14168·cs.CV·June 19, 2025

VideoMAR: Autoregressive Video Generatio with Continuous Tokens

Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, Feng Zhao

PDF

Open Access 1 Models

TL;DR

VideoMAR introduces an efficient autoregressive model for continuous video generation that leverages temporal and spatial principles, achieving state-of-the-art results with significantly reduced resources.

Contribution

The paper presents VideoMAR, a novel decoder-only autoregressive model for video generation using continuous tokens, with innovative training strategies and capacity for extrapolation.

Findings

01

Surpasses previous state-of-the-art on VBench-I2V benchmark.

02

Requires only 9.3% of parameters of previous models.

03

Achieves high efficiency with minimal training data and GPU resources.

Abstract

Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
inclusionAI/Ming-VideoMAR
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

MethodsDiffusion