VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao

TL;DR
This paper introduces VF-Eval, a comprehensive benchmark for assessing multimodal large language models' ability to evaluate AI-generated videos, revealing current models' limitations and potential improvements through human feedback alignment.
Contribution
The paper presents VF-Eval, a novel benchmark with four tasks for evaluating MLLMs on AIGC videos, and demonstrates how aligning models with human feedback can enhance video generation.
Findings
GPT-4.1 performs poorly across tasks, indicating difficulty in evaluating AIGC videos.
VF-Eval exposes limitations of current MLLMs in understanding synthetic videos.
Aligning MLLMs with human feedback improves video generation quality.
Abstract
MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
MethodsLinear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Dropout
