MAGI-1: Autoregressive Video Generation at Scale

Sand.ai; Hansi Teng; Hongyu Jia; Lei Sun; Lingzhi Li; Maolin Li; Mingqiu Tang; Shuai Han; Tianning Zhang; W.Q. Zhang; Weifeng Luo; Xiaoyang Kang; Yuchen Sun; Yue Cao; Yunpeng Huang; Yutong Lin; Yuxin Fang; Zewei Tao; Zheng Zhang; Zhongshu Wang; Zixun Liu; Dai Shi; Guoli Su; Hanwen Sun; Hong Pan; Jie Wang; Jiexin Sheng; Min Cui; Min Hu; Ming Yan; Shucheng Yin; Siran Zhang; Tingting Liu; Xianping Yin; Xiaoyu Yang; Xin Song; Xuan Hu; Yankai Zhang; Yuqiao Li

arXiv:2505.13211·cs.CV·May 20, 2025

MAGI-1: Autoregressive Video Generation at Scale

Sand.ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W.Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su

PDF

Open Access 2 Repos 1 Models

TL;DR

MAGI-1 is a scalable autoregressive video generation model that produces high-quality, temporally consistent videos from text prompts, supporting real-time streaming and controllable generation with a massive 24-billion-parameter architecture.

Contribution

This work introduces MAGI-1, a novel large-scale autoregressive video model that enables causal, streaming, and controllable video generation with unprecedented scalability and efficiency.

Findings

01

Achieves high temporal consistency in generated videos.

02

Supports context lengths up to 4 million tokens.

03

Maintains constant peak inference cost regardless of video length.

Abstract

We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sand-ai/MAGI-1
model· ♡ 604
♡ 604

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Pose and Action Recognition