InstanceCap: Improving Text-to-Video Generation via Instance-aware   Structured Caption

Tiehan Fan; Kepan Nan; Rui Xie; Penghao Zhou; Zhenheng Yang; Chaoyou; Fu; Xiang Li; Jian Yang; Ying Tai

arXiv:2412.09283·cs.CV·December 13, 2024

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou, Fu, Xiang Li, Jian Yang, Ying Tai

PDF

Open Access 1 Datasets

TL;DR

InstanceCap introduces an instance-aware structured caption framework that enhances text-to-video generation by providing fine-grained, detailed descriptions, significantly improving fidelity and reducing hallucinations in generated videos.

Contribution

The paper presents the first instance-level structured captioning method for text-to-video generation, including an auxiliary clustering model and a new dataset, InstanceVid.

Findings

01

Outperforms previous models in fidelity and accuracy

02

Reduces hallucinations in generated videos

03

Provides detailed, instance-aware video captions

Abstract

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AnonMegumi/InstanceVid
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation