VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of   Video Foundation Model

Xinhao Li; Zhenpeng Huang; Jing Wang; Kunchang Li; Limin Wang

arXiv:2407.06491·cs.CV·July 10, 2024

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VideoEval introduces a comprehensive benchmark suite for evaluating Video Foundation Models, addressing limitations of existing benchmarks by assessing task adaptability and representation power across diverse tasks and models.

Contribution

The paper presents VideoEval, a new benchmark suite that evaluates VFMs on task adaptability and representation power, revealing insights into their generalization and pre-training paradigms.

Findings

01

VFMs show weak generalization across tasks

02

More data does not always improve performance

03

Combining pre-training paradigms enhances generalization

Abstract

With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leexinhao/VideoEval
pytorchOfficial

Datasets

lixinhao/VideoEval
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Analysis and Summarization