VLM-Eval: A General Evaluation on Video Large Language Models
Shuailin Li, Yuang Zhang, Yucheng Zhao, Qiuyue Wang, Fan Jia, Yingfei, Liu, Tiancai Wang

TL;DR
This paper introduces VLM-Eval, a comprehensive evaluation framework for video Large Language Models that assesses multiple tasks and demonstrates GPT-based evaluation's effectiveness, alongside a simple baseline model.
Contribution
It presents a unified evaluation method for video LLMs across various tasks and introduces Video-LLaVA, a simple yet effective baseline outperforming existing models.
Findings
GPT-based evaluation matches human performance in assessing response quality.
Video-LLaVA outperforms existing video LLMs with a simple linear projection.
Video LLMs show promising recognition and reasoning in driving scenarios.
Abstract
Despite the rapid development of video Large Language Models (LLMs), a comprehensive evaluation is still absent. In this paper, we introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. In addition to conventional metrics, we showcase how GPT-based evaluation can match human-like performance in assessing response quality across multiple aspects. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. Finally, we evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning. We hope our work can serve as a unified evaluation for video LLMs, and help expand more practical scenarios. The evaluation code will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
