VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song; Tongyan Hu; Guo Gan; Yilun Zhao

arXiv:2505.23693·cs.CV·May 30, 2025

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces VF-Eval, a comprehensive benchmark for assessing multimodal large language models' ability to evaluate AI-generated videos, revealing current models' limitations and potential improvements through human feedback alignment.

Contribution

The paper presents VF-Eval, a novel benchmark with four tasks for evaluating MLLMs on AIGC videos, and demonstrates how aligning models with human feedback can enhance video generation.

Findings

01

GPT-4.1 performs poorly across tasks, indicating difficulty in evaluating AIGC videos.

02

VF-Eval exposes limitations of current MLLMs in understanding synthetic videos.

03

Aligning MLLMs with human feedback improves video generation quality.

Abstract

MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sighingsnow/vf-eval
noneOfficial

Datasets

songtingyu/vf-eval
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling

MethodsLinear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Dropout