Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Liu He; Yizhi Song; Hejun Huang; Pinxin Liu; Yunlong Tang; Daniel; Aliaga; Xin Zhou

arXiv:2408.10453·cs.CV·May 6, 2025·2 cites

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel, Aliaga, Xin Zhou

PDF

Open Access

TL;DR

This paper presents an automated pipeline for synthetic video generation using collaborative Vision Large Language Model agents that create and refine Blender scripts based on textual descriptions, resulting in higher quality videos.

Contribution

Introduces a novel multi-agent VLM-based framework for automatic, text-driven synthetic video creation that improves quality and consistency over existing models.

Findings

01

Generated videos outperform commercial models in quality metrics.

02

Framework achieves higher scores in user studies for quality and rationality.

03

Collaborative agent approach enhances physical realism and temporal consistency.

Abstract

Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Digital Games and Media

MethodsRoIAlign · RoIPool · Softmax