GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin; Haibo Wang; Zhiyang Xu; Siyao Dai; Huanjie Dong; Xiaohan Wang; Yolo Y. Tang; Yixin Wang; Qifan Wang; Lifu Huang

arXiv:2604.05076·cs.MA·April 8, 2026

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin, Haibo Wang, Zhiyang Xu, Siyao Dai, Huanjie Dong, Xiaohan Wang, Yolo Y. Tang, Yixin Wang, Qifan Wang, Lifu Huang

PDF

TL;DR

GLANCE is a multi-agent framework that improves music-grounded nonlinear video editing by integrating global planning and local refinement, outperforming existing methods on a new benchmark.

Contribution

The paper introduces GLANCE, a novel global-local coordination multi-agent system with a bi-loop architecture and a new benchmark for music-grounded video editing.

Findings

01

GLANCE outperforms prior baselines by 33.2% and 15.6% on two tasks.

02

The framework effectively manages cross-segment conflicts and long-range constraints.

03

Human evaluation confirms the high quality of generated videos.

Abstract

Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the "Observe-Think-Act-Verify" flow for segment-wise editing tasks and their refinements. To address the cross-segment and global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.