GenMAC: Compositional Text-to-Video Generation with Multi-Agent   Collaboration

Kaiyi Huang; Yukun Huang; Xuefei Ning; Zinan Lin; Yu Wang; Xihui Liu

arXiv:2412.04440·cs.CV·December 6, 2024

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

PDF

Open Access 1 Video

TL;DR

GenMAC introduces a multi-agent, iterative framework for compositional text-to-video generation, effectively handling complex scenes by decomposing tasks and collaboratively refining outputs, leading to state-of-the-art results.

Contribution

This work presents a novel multi-agent, iterative approach with specialized agents and self-routing to improve compositional text-to-video generation.

Findings

01

Achieves state-of-the-art performance on compositional text-to-video benchmarks.

02

Effectively decomposes complex tasks into simpler, role-specific agent collaborations.

03

Demonstrates significant improvements over existing models in handling complex scene composition.

Abstract

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration· underline

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Topic Modeling