GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

TL;DR
GenMAC introduces a multi-agent, iterative framework for compositional text-to-video generation, effectively handling complex scenes by decomposing tasks and collaboratively refining outputs, leading to state-of-the-art results.
Contribution
This work presents a novel multi-agent, iterative approach with specialized agents and self-routing to improve compositional text-to-video generation.
Findings
Achieves state-of-the-art performance on compositional text-to-video benchmarks.
Effectively decomposes complex tasks into simpler, role-specific agent collaborations.
Demonstrates significant improvements over existing models in handling complex scene composition.
Abstract
Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Topic Modeling
