Efficient Continuous Video Flow Model for Video Prediction
Gaurav Shrivastava, Abhinav Shrivastava

TL;DR
This paper introduces an efficient continuous video flow model that significantly reduces latency and computational costs in multi-step video prediction, achieving state-of-the-art results on multiple datasets.
Contribution
The paper presents a novel modeling approach that decreases sample steps and model size for improved video prediction efficiency.
Findings
Reduces prediction latency by decreasing sample steps.
Minimizes model size to one-third of original.
Achieves state-of-the-art performance on multiple datasets.
Abstract
Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR…
Peer Reviews
Decision·Submitted to ICLR 2025
- This work addressed an important and hard problem. - Using latent space for generation sounds effective for reducing latency and improving performance. Which may reduce overall complexity and run time of diffusion models for video prediction task. - Presented a detailed experimental results.
- Not easy to follow theoretical justifications and derivations in the method section. - Please refer to "Questions" section.
- Handles a very challenging problem of video future prediction. - New approach, different from prior works on using GANs or pixel-space diffusion. - Showed results on a variety of challenging benchmarks.
- The forumulation of the solution is not technically convincing. For example, the equation 1 is directly written without any intuition, reference or justification of why this is the most optimum modeling choice. In general, this subsumes a lot of assumptions about motion modeling in real videos and seems generally restrictive to model challenging scenarios like large motion, shot changes, occlusions and pixel-space variations. Since the whole work rests upon this assumption, the authors are req
- The method is well motivated --- that careful method construction is needed to improve the efficiency of video generation methods. - The paper incorporates evaluation on multiple video datasets and seems to outperform the relevant methods. - The paper performs the generation in latent space rather than in pixel space.
- The discussion of the method itself is rather terse and hard to understand. Examples include challenges with undefined notation, an unclear concrete problem definition, and limited concreteness around the reduction to practice. This detracts from one's ability to sufficiently understand the results and contextualize them. - The relationship between classical conditional diffusion and this proposed method needs to be better explained. - Although the evaluation is rich in terms of datasets
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
