VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang

TL;DR
VideoEval introduces a comprehensive benchmark suite for evaluating Video Foundation Models, addressing limitations of existing benchmarks by assessing task adaptability and representation power across diverse tasks and models.
Contribution
The paper presents VideoEval, a new benchmark suite that evaluates VFMs on task adaptability and representation power, revealing insights into their generalization and pre-training paradigms.
Findings
VFMs show weak generalization across tasks
More data does not always improve performance
Combining pre-training paradigms enhances generalization
Abstract
With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Analysis and Summarization
