AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling
Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

TL;DR
AtlasVid introduces a decoupled global-local framework that significantly enhances the efficiency of ultra-high-resolution long video generation, enabling high-quality outputs with reduced training costs and increased speed.
Contribution
The paper proposes a novel decoupled global-local modeling approach that allows resolution-agnostic training and efficient high-resolution long video synthesis.
Findings
Achieves 60.9x speedup in UHR long video generation.
Outperforms native 4K video generators in quality and efficiency.
Enables training at 720P with generalization to 4K and beyond.
Abstract
Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
