InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou, Fu, Xiang Li, Jian Yang, Ying Tai

TL;DR
InstanceCap introduces an instance-aware structured caption framework that enhances text-to-video generation by providing fine-grained, detailed descriptions, significantly improving fidelity and reducing hallucinations in generated videos.
Contribution
The paper presents the first instance-level structured captioning method for text-to-video generation, including an auxiliary clustering model and a new dataset, InstanceVid.
Findings
Outperforms previous models in fidelity and accuracy
Reduces hallucinations in generated videos
Provides detailed, instance-aware video captions
Abstract
Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation
