Video-GPT via Next Clip Diffusion
Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang

TL;DR
This paper introduces Video-GPT, a novel model that treats videos as a new language for visual world modeling, enabling effective short-term and long-term video prediction through a next clip diffusion paradigm, achieving state-of-the-art results.
Contribution
The paper proposes a new next clip diffusion paradigm for Video-GPT, allowing it to handle both short-term generation and long-term prediction tasks effectively.
Findings
Achieves state-of-the-art performance on video prediction benchmarks.
Demonstrates strong generalization across 6 video tasks.
Outperforms previous models in Physics-IQ Benchmark.
Abstract
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark:…
Peer Reviews
Decision·ICLR 2026 Poster
State of the results on multiple tasks is the main strength of this paper. The idea of interleaved clip level noise/de-noise, although intuitive, is novel IMO. Video level self-supervision has been overlooked, IMO. Research like this can bring more attention and opens the road for future works in video domain.
Some design choices for training is not trivial to me and I need more clarification. (will ask in question section). I believe such a method works only on single camera and continious (or single scene) videos. If there is a POV change in a video, like Movies or TV shows, I believe that it will break the whole network. Motion is not modeled very well in this work. I am curious to know how this model can predict videos where there is partly stationary clips (minimal motion) and partly abrupt mot
1. The "next clip diffusion" paradigm is a creative combination of autoregressive modeling (from GPT) and diffusion (for high-quality generation). Treating clips as visual words and using historical clean clips as context for denoising is a novel adaptation of language modeling to video, filling the gap between discrete text tokens and continuous video data. This hybrid design effectively unifies short-term generation and long-term prediction. 2. As a unified video foundation model, Video-GPT br
1. Insufficient comparison with hybrid baselines: The paper mentions prior works that combine diffusion and autoregressive modeling but lacks a detailed comparison of their core differences. 2. There is a lack of comparisons with some newer autoregressive + diffusion video generation models, such as self-forcing, apt2. 3. Limited analysis of architectural choices: The model inherits Phi-3-mini’s architecture and SDXL’s VAE without justifying these choices. There is no comparison with other arch
The primary strength of this work is the impressive engineering effort demonstrated in building and evaluating a complete system. The model achieves a state-of-the-art score on the Physics-IQ benchmark, suggesting its pre-training paradigm is effective at capturing physical dynamics and motion continuity. Furthermore, the extensive fine-tuning across a wide array of both generation and understanding tasks showcases the versatility and potential of the resulting pretrained model.
Despite the strong results on specific benchmarks, this paper has significant weaknesses that undermine its contribution as a top-tier research publication. Firstly, the technical novelty is limited; the "next clip diffusion" idea is a combination of existing autoregressive and diffusion frameworks rather than a fundamental new technique. Secondly, and more critically, the evaluation feels dated and deliberately avoids direct comparison with the true state-of-the-art in video generation quality.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Cosine Annealing · Attention Dropout · Residual Connection · Linear Layer · Byte Pair Encoding · Weight Decay · Dropout
