TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya, Grover, Kai-Wei Chang

TL;DR
This paper introduces TALC, a framework that enables pretrained text-to-video models to generate multi-scene videos aligned with detailed scene descriptions, improving visual consistency and adherence to multi-scene text prompts.
Contribution
The paper proposes a novel Time-Aligned Captions (TALC) framework that enhances T2V models to generate multi-scene videos with better scene alignment and consistency, and demonstrates its effectiveness through finetuning.
Findings
TALC improves multi-scene video generation quality.
Finetuned models outperform baselines by 29% in human evaluation.
Enhanced text-conditioning achieves better scene alignment.
Abstract
Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Motion and Animation · Multimodal Machine Learning Applications
