Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge,, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi, Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

TL;DR
This paper introduces T2VScore, a new evaluation metric for text-to-video generation that combines text-video alignment and video quality assessment, addressing the limitations of existing metrics and providing a more reliable evaluation method.
Contribution
The paper proposes T2VScore, a novel evaluation pipeline for text-to-video models, and introduces the TVGE dataset with human judgments to benchmark and improve video quality assessment.
Findings
T2VScore outperforms existing metrics in evaluating text-to-video quality.
The TVGE dataset provides a valuable resource with human judgments for future research.
Experiments show T2VScore aligns better with human perception than traditional metrics.
Abstract
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimedia Communication and Technology
MethodsContrastive Language-Image Pre-training
