VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen; Yuanxing Zhang; Chongling Rao; Yushuo Guan; Jiaheng Liu; Fuzheng Zhang; Chengru Song; Qiang Liu; Di Zhang; Tieniu Tan

arXiv:2502.12782·cs.AI·May 20, 2025

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan

PDF

Open Access 1 Repo

TL;DR

VidCapBench is a new comprehensive evaluation scheme for video captioning tailored to controllable text-to-video generation, linking caption quality with T2V model performance and aiding development.

Contribution

It introduces a novel, flexible video caption evaluation framework that correlates well with T2V quality, enhancing assessment and training of T2V models.

Findings

01

VidCapBench outperforms existing captioning evaluation methods in stability and coverage.

02

Scores on VidCapBench significantly correlate with T2V model quality metrics.

03

The scheme supports both rapid and thorough video caption evaluation.

Abstract

The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vidcapbench/vidcapbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization