AutoEval-Video: An Automatic Benchmark for Assessing Large Vision   Language Models in Open-Ended Video Question Answering

Xiuyuan Chen; Yuan Lin; Yuchen Zhang; Weiran Huang

arXiv:2311.14906·cs.CV·July 16, 2024·1 cites

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, Weiran Huang

PDF

Open Access 1 Repo

TL;DR

AutoEval-Video introduces a comprehensive benchmark for evaluating large vision-language models in open-ended video question answering, utilizing instance-specific rules and adversarial annotation to ensure robust, human-comparable assessment.

Contribution

The paper presents a novel benchmark with instance-specific evaluation rules and an adversarial annotation mechanism, enabling accurate, automated assessment of vision-language models in complex video QA tasks.

Findings

01

GPT-4V(ision) achieves 32.2% accuracy on AutoEval-Video.

02

Human accuracy on the benchmark is 72.8%.

03

GPT-4-based evaluation achieves 97.0% stability, comparable to human evaluators.

Abstract

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiuyuan-chen/autoeval-video
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Absolute Position Encodings · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing