TL;DR
This paper presents VC4VG, a novel framework for optimizing video captions specifically for text-to-video models, improving video generation quality by designing better captions and providing a new benchmark for evaluation.
Contribution
Introduction of VC4VG, a comprehensive caption optimization framework tailored for T2V models, along with VC4VG-Bench, a new benchmark for evaluation.
Findings
Improved caption quality correlates with better video generation performance.
The proposed methodology enhances T2V training effectiveness.
Benchmark tools facilitate future research in caption optimization for T2V.
Abstract
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
