AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
Xiuyuan Chen, Yuan Lin, Yuchen Zhang, Weiran Huang

TL;DR
AutoEval-Video introduces a comprehensive benchmark for evaluating large vision-language models in open-ended video question answering, utilizing instance-specific rules and adversarial annotation to ensure robust, human-comparable assessment.
Contribution
The paper presents a novel benchmark with instance-specific evaluation rules and an adversarial annotation mechanism, enabling accurate, automated assessment of vision-language models in complex video QA tasks.
Findings
GPT-4V(ision) achieves 32.2% accuracy on AutoEval-Video.
Human accuracy on the benchmark is 72.8%.
GPT-4-based evaluation achieves 97.0% stability, comparable to human evaluators.
Abstract
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Absolute Position Encodings · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing
