TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Hritik Bansal; Yonatan Bitton; Michal Yarom; Idan Szpektor; Aditya; Grover; Kai-Wei Chang

arXiv:2405.04682·cs.CV·November 11, 2024

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya, Grover, Kai-Wei Chang

PDF

Open Access 1 Repo

TL;DR

This paper introduces TALC, a framework that enables pretrained text-to-video models to generate multi-scene videos aligned with detailed scene descriptions, improving visual consistency and adherence to multi-scene text prompts.

Contribution

The paper proposes a novel Time-Aligned Captions (TALC) framework that enhances T2V models to generate multi-scene videos with better scene alignment and consistency, and demonstrates its effectiveness through finetuning.

Findings

01

TALC improves multi-scene video generation quality.

02

Finetuned models outperform baselines by 29% in human evaluation.

03

Enhanced text-conditioning achieves better scene alignment.

Abstract

Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hritikbansal/talc
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimodal Machine Learning Applications