Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov

TL;DR
This paper introduces hierarchical patch diffusion models that improve high-resolution video generation by enforcing patch consistency and adaptive computation, achieving state-of-the-art results and enabling efficient end-to-end training on ultra-high resolutions.
Contribution
The paper presents a novel hierarchical patch diffusion architecture with deep context fusion and adaptive computation, significantly enhancing high-resolution video synthesis capabilities.
Findings
Achieved a new state-of-the-art FVD score of 66.32 on UCF-101.
Surpassed recent methods by more than 100% in Inception Score.
First diffusion-based model trained end-to-end on ultra-high-resolution videos.
Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Computer Graphics and Visualization Techniques · Image and Video Quality Assessment
MethodsBalanced Selection · Diffusion
