Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
Liu He, Yizhi Song, Hejun Huang, Pinxin Liu, Yunlong Tang, Daniel, Aliaga, Xin Zhou

TL;DR
This paper presents an automated pipeline for synthetic video generation using collaborative Vision Large Language Model agents that create and refine Blender scripts based on textual descriptions, resulting in higher quality videos.
Contribution
Introduces a novel multi-agent VLM-based framework for automatic, text-driven synthetic video creation that improves quality and consistency over existing models.
Findings
Generated videos outperform commercial models in quality metrics.
Framework achieves higher scores in user studies for quality and rationality.
Collaborative agent approach enhances physical realism and temporal consistency.
Abstract
Text-to-video generation has been dominated by diffusion-based or autoregressive models. These novel models provide plausible versatility, but are criticized for improper physical motion, shading and illumination, camera motion, and temporal consistency. The film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos address these shortcomings, but require tight collaboration between movie makers and 3D rendering experts. We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a language description of a video, multiple VLM agents direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video following the given description. Augmented with Blender-based movie making knowledge, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Digital Games and Media
MethodsRoIAlign · RoIPool · Softmax
