UltraGen: High-Resolution Video Generation with Hierarchical Attention

Teng Hu; Jiangning Zhang; Zihan Su; Ran Yi

arXiv:2510.18775·cs.CV·October 22, 2025

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi

PDF

Open Access 1 Video

TL;DR

UltraGen introduces a hierarchical attention framework that enables efficient, high-resolution video generation up to 4K, overcoming previous computational limitations of diffusion transformer models.

Contribution

It presents a novel hierarchical dual-branch attention architecture with global-local decomposition and spatial compression for scalable high-resolution video synthesis.

Findings

01

Successfully scales models to 1080P and 4K resolutions.

02

Outperforms existing state-of-the-art methods in quality and efficiency.

03

Enables end-to-end native high-resolution video generation.

Abstract

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UltraGen: High-Resolution Video Generation with Hierarchical Attention· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Image and Video Quality Assessment