Pyramidal Flow Matching for Efficient Video Generative Modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan, Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin

TL;DR
This paper introduces a unified pyramidal flow matching algorithm for efficient high-resolution video generation, enabling end-to-end training and high-quality results with reduced computational resources.
Contribution
It proposes a novel pyramidal flow matching approach that unifies the generation process, improving efficiency and flexibility over previous cascaded architectures.
Findings
Supports high-quality 768p, 24 FPS video generation
Achieves 10-second videos within 20.7k GPU hours
Enables end-to-end training with a single Diffusion Transformer
Abstract
Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution…
Peer Reviews
Decision·ICLR 2025 Poster
+ The proposed pyramidal flow matching scheme is novel in video generation modeling and greatly enhances training efficiency. + The unified training objective is intuitive and effective. + The quantitative and qualitative analyses in the paper are comprehensive. + The quality of the generated videos is excellent.
* The writing in the paper could benefit from further improvement for clarity and readability. - The repeated use of the term "full-resolution" up to the experiment section suggests that generation is being done in pixel space rather than latent space. It would be helpful to clarify this in the paper, as it may be misleading. - The paper contains several grammatical errors and repeatedly uses unnecessary terms, such as "sophisticated," which affect readability. I encourage the authors t
The strength of this paper is multi-fold. + It builds a flow matching model with multiple resolutions for text-to-video generation. The pyramidal flow matching allows the model to train with less computational costs and memory footprints. + The whole model has a unified objective instead of optimizing separate modules for video generation and super-resolution, using a single Diffusion Transformer. + The experimental results are competitive, with evaluation on two public benchmarks of VBench and
While this work has an interesting novel design for flow matching for video generation and competitive visual results, there are some unclear points and weaknesses as follows. - Questions about [s_k, e_k]. The authors divide [0,1] into K time windows [s_k,e_k]. Why don't the authors set e_{k+1}=s_k? Instead, the authors use e_{k+1}=2s_k/(1+s_k) and we can that e_{k+1}>s_k. This means there are overlapping time windows. Given a time step t, t may fall on more than one time step, and how do author
- The ideas of both spatial pyramids and temporal pyramids are novel and interesting. - The training efficiency is largely improved due to the novel pyramid design.
- The analysis of inference efficiency is lacking. How does the proposed method compare to previous full-attention methods for different numbers of frames? - Compared to full-attention methods, the proposed autoregressive method may encounter the issue of error drifting when the number of frames increases. At how many frames, the proposed method will fail? - It will be good to include some video results from previous methods on the project page. - Figure 7 and Figure 8 both show partial results
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Pose and Action Recognition
MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
