Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation
Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

TL;DR
This paper introduces a hierarchical planning framework for long video generation that improves quality, consistency, and efficiency by combining macro and micro planning stages with parallelized content generation.
Contribution
It proposes a novel Macro-from-Micro Planning framework that enables long, high-quality, and parallelized autoregressive video synthesis, addressing temporal drift and scalability issues.
Findings
Outperforms existing models in video quality and stability
Enables parallelized generation of long videos
Achieves efficient long-term consistency across segments
Abstract
Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro…
Peer Reviews
Decision·Submitted to ICLR 2026
- MMPL proposes a hierarchical autoregressive planning pipeline for long-video generation. - MMPL could parallelly synthesizes frames for multiple video segments guided by pre-planned keyframes. - MMPL incorporates a workload scheduling strategy to minimize the overhead of the proposed pipeline.
- **Decoupled planning and generation pipelines.** MMPL introduces a typical divide-and-conquer strategy to schedule the long-video generation task. I consider a key challenge lies in the fidelity of motion modeling during the planning stage when complete frames are unavailable (i.e., MMPL seems not to be an end-to-end training method). The authors should report a dynamic degree metric and compare both subject and camera motion against standard full or causal attention baselines. - **Incomplete
- This paper tackles an important and practical problem—generating long, high-quality videos with autoregressive models—whose limitations in temporal drift and sequential bottlenecks are well analyzed. - The authors propose an elegant two-level planning framework that decouples long-range dependency modeling from dense frame generation, achieving both consistency and parallelism without architectural surgery. - Experiments are good, combining automatic metrics, human evaluation, ablations, and
My major concern about this paper lies in the rather limited performance improvement demonstrated by the proposed technique. In other words, the claimed state-of-the-art results may largely stem from an unfair comparison. Specifically, the method is trained on an exceptionally strong baseline model—Wan-2.1—yet Table 1 does not include any comparison with Wan-2.1 itself. One possible reason for this omission might be that Wan-2.1 cannot generate videos as long as those presented in this work. How
1. Making long-video AR generation as plan-then-populate with keyframes per segment, then parallel population conditioned on those anchors, is a clean, modular perspective. While planning and hierarchical generation exist, the specific coupling of joint keyframe prediction from the segment’s first frame plus segment-level AR chaining is crisply articulated and differs from step-jump or packed-context approaches. 2. The two-mode scheduler (minimum memory peak vs. maximum throughput) provides an e
1. Questionable novelty boundary vs. prior “planning/parallel AR” lines. The closest prior they cite, FramePack-Plan, already reduces error via step-wise frame jumping and context compression; other works use hierarchical/story planning or parallelized AR decoding. The paper states three innovations (two-level plan, one-pass segment keyframe prediction, then parallel population), but the empirical study does not isolate what is fundamentally new versus combinations/engineering of known technique
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Visual Attention and Saliency Detection
