Bridging Your Imagination with Audio-Video Generation via a Unified Director
Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen

TL;DR
UniMAGE is a unified AI model that combines script writing and keyframe generation to enable non-experts to create coherent, multi-shot videos from prompts, using a novel architecture and training paradigm.
Contribution
This work introduces UniMAGE, a novel unified transformer-based framework that integrates script drafting and keyframe generation for AI-driven video creation.
Findings
Achieves state-of-the-art performance in open-source video generation
Generates logically coherent scripts and visually consistent keyframes
Enables non-experts to produce complex videos from prompts
Abstract
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games
