VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang; Hongfa Wang; Duojun Huang; Xin Li; Xiaobin Zhu; Xu-Cheng Yin

arXiv:2505.23484·cs.CV·May 30, 2025

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Shi-Xue Zhang, Hongfa Wang, Duojun Huang, Xin Li, Xiaobin Zhu, Xu-Cheng Yin

PDF

Open Access 1 Repo 1 Video

TL;DR

VCapsBench is a comprehensive large-scale benchmark designed to evaluate the fine-grained quality of video captions, emphasizing spatial-temporal details to improve text-to-video generation models.

Contribution

It introduces the first extensive fine-grained benchmark with detailed annotations and novel evaluation metrics for assessing video caption quality.

Findings

01

Benchmark includes 5,677 videos and 109,796 QA pairs across 21 dimensions.

02

Proposes three metrics: Accuracy, Inconsistency Rate, Coverage Rate.

03

Automated evaluation pipeline using large language models for caption verification.

Abstract

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gxym/vcapsbench
pytorchOfficial

Videos

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization