VLM-Eval: A General Evaluation on Video Large Language Models

Shuailin Li; Yuang Zhang; Yucheng Zhao; Qiuyue Wang; Fan Jia; Yingfei; Liu; Tiancai Wang

arXiv:2311.11865·cs.CV·November 21, 2023·1 cites

VLM-Eval: A General Evaluation on Video Large Language Models

Shuailin Li, Yuang Zhang, Yucheng Zhao, Qiuyue Wang, Fan Jia, Yingfei, Liu, Tiancai Wang

PDF

Open Access

TL;DR

This paper introduces VLM-Eval, a comprehensive evaluation framework for video Large Language Models that assesses multiple tasks and demonstrates GPT-based evaluation's effectiveness, alongside a simple baseline model.

Contribution

It presents a unified evaluation method for video LLMs across various tasks and introduces Video-LLaVA, a simple yet effective baseline outperforming existing models.

Findings

01

GPT-based evaluation matches human performance in assessing response quality.

02

Video-LLaVA outperforms existing video LLMs with a simple linear projection.

03

Video LLMs show promising recognition and reasoning in driving scenarios.

Abstract

Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessing response quality across multiple aspects. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. Finally, we evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning. We hope our work can serve as a unified evaluation for video LLMs, and help expand more practical scenarios. The evaluation code will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling