VideoGLUE: Video General Understanding Evaluation of Foundation Models
Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin, Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail, Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang,, Ting Liu, Boqing Gong

TL;DR
This paper systematically evaluates the video understanding capabilities of foundation models across multiple tasks and datasets, revealing that task-specific models outperform FMs and that video-native FMs are generally more effective for video tasks.
Contribution
It provides a comprehensive benchmark and analysis of foundation models for video understanding, highlighting the importance of task specialization and pretraining modality.
Findings
Task-specific models outperform foundation models in video tasks.
Video-native FMs excel in motion-rich video classification and localization.
Light adaptation suffices for video-native FMs, while full finetuning benefits image-native FMs.
Abstract
We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs' efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich…
Peer Reviews
Decision·Submitted to ICLR 2024
1) The paper has an excellent presentation with clear discussion and motivation for the work done. The visualizations are very nice and they further facilitate the text of paper and experiments. 2) Evaluating six foundation models on three main video tasks and eight datasets is a valuable contribution to the research community. It is even more valuable because different adaptions are also considered (fine-tuning, low-rank adapter, and others). Experiments are performed thoroughly with good anal
1) Despite the contribution of suggesting a VideoGLUE score, there are no novel models, modifications, or datasets presented in the paper. In my opinion, it would be very beneficial to develop at least one of them as an additional contribution to make the strongest possible paper. 2) The list of studied foundation models is not as comprehensive as potentially can be. I understand, that not models are publicly available but it would be very interesting and important to make even stronger conclus
1. Evaluating Foundation Models (FMs): The paper emphasizes the complexity involved in evaluating FMs, particularly because they are designed as "generalists" that learn meta-knowledge across tasks. This highlights the need for a standardized evaluation procedure, which this paper aims to provide. 2. VideoGLUE Protocol: The proposed evaluation protocol provides a structured approach to evaluate FMs on video understanding, encompassing various tasks, datasets, and model adaptation methods. This
1. Different datasets emphasize varied aspects in video tasks; for instance, SSV2 focuses on motion, while Kinetics is more context-centric. How does VideoGLUE address the differences among these diverse datasets? 2. While the study delves into transformer-based Foundation Models (FMs), is there any comprehensive analysis or comparison involving 3D-CNN or even 2D-CNN based FMs?
1. The motivation is both clear and justified. There is a pressing need in the community to establish a benchmark for assessing the video understanding capabilities of foundation models. 2. The introduced VideoGLUE benchmark evaluates foundation models across various dimensions such as tasks, datasets, and adaptation methods. 3. This paper highlights three interesting findings into the video understanding capabilities of current foundation models. 4. The paper is well written and easy to follow.
1. This paper analyzes six foundational models, varying in size, pre-training data, and objectives. The diversity of these settings compromises the comparability of the experimental results, rendering the conclusions less reliable. 2. This paper introduces a scalar VideoGLUE score to assess FM's efficacy and efficiency by averaging performance scores across four adaptation methods. Yet, the rationale behind the metric's design appears arbitrary. It's unclear why this particular weighted score, d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
