InstanceV: Instance-Level Video Generation

Yuheng Chen; Teng Hu; Jiangning Zhang; Zhucun Xue; Ran Yi; Lizhuang Ma

arXiv:2511.23146·cs.CV·December 1, 2025

InstanceV: Instance-Level Video Generation

Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma

PDF

Open Access

TL;DR

InstanceV introduces an instance-level controllable video generation framework that enhances fine-grained control and semantic consistency, outperforming existing models in quality and instance accuracy.

Contribution

The paper presents InstanceV, a novel framework with instance-aware mechanisms and a new benchmark for improved fine-grained control in video generation.

Findings

01

Achieves superior instance-level controllability in video synthesis.

02

Outperforms state-of-the-art models in quality and instance metrics.

03

Introduces InstanceBench for comprehensive evaluation.

Abstract

Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning