InstanceV: Instance-Level Video Generation
Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma

TL;DR
InstanceV introduces an instance-level controllable video generation framework that enhances fine-grained control and semantic consistency, outperforming existing models in quality and instance accuracy.
Contribution
The paper presents InstanceV, a novel framework with instance-aware mechanisms and a new benchmark for improved fine-grained control in video generation.
Findings
Achieves superior instance-level controllability in video synthesis.
Outperforms state-of-the-art models in quality and instance metrics.
Introduces InstanceBench for comprehensive evaluation.
Abstract
Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
