Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin; Zhicheng Sun; Ningyuan Li; Kun Xu; Kun Xu; Hao Jiang; Nan; Zhuang; Quzhe Huang; Yang Song; Yadong Mu; Zhouchen Lin

arXiv:2410.05954·cs.CV·March 18, 2025·2 cites

Pyramidal Flow Matching for Efficient Video Generative Modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan, Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin

PDF

Open Access 1 Repo 5 Models 3 Reviews

TL;DR

This paper introduces a unified pyramidal flow matching algorithm for efficient high-resolution video generation, enabling end-to-end training and high-quality results with reduced computational resources.

Contribution

It proposes a novel pyramidal flow matching approach that unifies the generation process, improving efficiency and flexibility over previous cascaded architectures.

Findings

01

Supports high-quality 768p, 24 FPS video generation

02

Achieves 10-second videos within 20.7k GPU hours

03

Enables end-to-end training with a single Diffusion Transformer

Abstract

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

+ The proposed pyramidal flow matching scheme is novel in video generation modeling and greatly enhances training efficiency. + The unified training objective is intuitive and effective. + The quantitative and qualitative analyses in the paper are comprehensive. + The quality of the generated videos is excellent.

Weaknesses

* The writing in the paper could benefit from further improvement for clarity and readability. - The repeated use of the term "full-resolution" up to the experiment section suggests that generation is being done in pixel space rather than latent space. It would be helpful to clarify this in the paper, as it may be misleading. - The paper contains several grammatical errors and repeatedly uses unnecessary terms, such as "sophisticated," which affect readability. I encourage the authors t

Reviewer 02Rating 8Confidence 5

Strengths

The strength of this paper is multi-fold. + It builds a flow matching model with multiple resolutions for text-to-video generation. The pyramidal flow matching allows the model to train with less computational costs and memory footprints. + The whole model has a unified objective instead of optimizing separate modules for video generation and super-resolution, using a single Diffusion Transformer. + The experimental results are competitive, with evaluation on two public benchmarks of VBench and

Weaknesses

While this work has an interesting novel design for flow matching for video generation and competitive visual results, there are some unclear points and weaknesses as follows. - Questions about [s_k, e_k]. The authors divide [0,1] into K time windows [s_k,e_k]. Why don't the authors set e_{k+1}=s_k? Instead, the authors use e_{k+1}=2s_k/(1+s_k) and we can that e_{k+1}>s_k. This means there are overlapping time windows. Given a time step t, t may fall on more than one time step, and how do author

Reviewer 03Rating 6Confidence 5

Strengths

- The ideas of both spatial pyramids and temporal pyramids are novel and interesting. - The training efficiency is largely improved due to the novel pyramid design.

Weaknesses

- The analysis of inference efficiency is lacking. How does the proposed method compare to previous full-attention methods for different numbers of frames? - Compared to full-attention methods, the proposed autoregressive method may encounter the issue of error drifting when the number of frames increases. At how many frames, the proposed method will fail? - It will be good to include some video results from previous methods on the project page. - Figure 7 and Figure 8 both show partial results

Code & Models

Repositories

jy0205/Pyramid-Flow
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings