SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun Suzuki

TL;DR
SLVMEval is a benchmark for evaluating text-to-long-video systems by synthetic degradation and crowdsourced pairwise comparisons, revealing current evaluation weaknesses compared to human judgment.
Contribution
The paper introduces a synthetic, controlled benchmark for meta-evaluating long-video assessment systems, highlighting their shortcomings relative to human evaluators.
Findings
Humans achieve 84.7%-96.8% accuracy in video quality assessment.
Existing evaluation systems underperform humans in 9 out of 10 aspects.
SLVMEval provides a controlled testbed for assessing evaluation system reliability.
Abstract
This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
