Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang; Yabo Chen; Guiyu Zhang; Zhongyu Wang; Zhe Gao; Quanming Xiang; Gonghu Shang; Junqi Liu; Haibin Huang; Yang Gao; Chi Zhang; Qi Fan; Xuelong Li

arXiv:2508.03334·cs.CV·October 15, 2025

Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces a hierarchical planning framework for long video generation that improves quality, consistency, and efficiency by combining macro and micro planning stages with parallelized content generation.

Contribution

It proposes a novel Macro-from-Micro Planning framework that enables long, high-quality, and parallelized autoregressive video synthesis, addressing temporal drift and scalability issues.

Findings

01

Outperforms existing models in video quality and stability

02

Enables parallelized generation of long videos

03

Achieves efficient long-term consistency across segments

Abstract

Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 5

Strengths

- MMPL proposes a hierarchical autoregressive planning pipeline for long-video generation. - MMPL could parallelly synthesizes frames for multiple video segments guided by pre-planned keyframes. - MMPL incorporates a workload scheduling strategy to minimize the overhead of the proposed pipeline.

Weaknesses

- **Decoupled planning and generation pipelines.** MMPL introduces a typical divide-and-conquer strategy to schedule the long-video generation task. I consider a key challenge lies in the fidelity of motion modeling during the planning stage when complete frames are unavailable (i.e., MMPL seems not to be an end-to-end training method). The authors should report a dynamic degree metric and compare both subject and camera motion against standard full or causal attention baselines. - **Incomplete

Reviewer 02Rating 4Confidence 3

Strengths

- This paper tackles an important and practical problem—generating long, high-quality videos with autoregressive models—whose limitations in temporal drift and sequential bottlenecks are well analyzed. - The authors propose an elegant two-level planning framework that decouples long-range dependency modeling from dense frame generation, achieving both consistency and parallelism without architectural surgery. - Experiments are good, combining automatic metrics, human evaluation, ablations, and

Weaknesses

My major concern about this paper lies in the rather limited performance improvement demonstrated by the proposed technique. In other words, the claimed state-of-the-art results may largely stem from an unfair comparison. Specifically, the method is trained on an exceptionally strong baseline model—Wan-2.1—yet Table 1 does not include any comparison with Wan-2.1 itself. One possible reason for this omission might be that Wan-2.1 cannot generate videos as long as those presented in this work. How

Reviewer 03Rating 6Confidence 4

Strengths

1. Making long-video AR generation as plan-then-populate with keyframes per segment, then parallel population conditioned on those anchors, is a clean, modular perspective. While planning and hierarchical generation exist, the specific coupling of joint keyframe prediction from the segment’s first frame plus segment-level AR chaining is crisply articulated and differs from step-jump or packed-context approaches. 2. The two-mode scheduler (minimum memory peak vs. maximum throughput) provides an e

Weaknesses

1. Questionable novelty boundary vs. prior “planning/parallel AR” lines. The closest prior they cite, FramePack-Plan, already reduces error via step-wise frame jumping and context compression; other works use hierarchical/story planning or parallelized AR decoding. The paper states three innovations (two-level plan, one-pass segment keyframe prediction, then parallel population), but the empirical study does not isolate what is fundamentally new versus combinations/engineering of known technique

Code & Models

Models

🤗
Tele-AI/MMPL
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Visual Attention and Saliency Detection