Revisiting the "Video" in Video-Language Understanding

Shyamal Buch; Crist\'obal Eyzaguirre; Adrien Gaidon; Jiajun Wu; Li; Fei-Fei; Juan Carlos Niebles

arXiv:2206.01720·cs.CV·June 6, 2022

Revisiting the "Video" in Video-Language Understanding

Shyamal Buch, Crist\'obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li, Fei-Fei, Juan Carlos Niebles

PDF

Open Access 1 Repo

TL;DR

This paper introduces the atemporal probe (ATP), a new model for video-language analysis that helps evaluate and improve understanding of temporal aspects in videos, revealing current benchmarks often overlook temporal complexity.

Contribution

The paper proposes ATP as a novel model to bound baseline accuracy and enhance dataset and model design for better temporal understanding in video-language tasks.

Findings

01

Understanding event temporality is often not required for strong performance.

02

ATP can improve dataset analysis by identifying temporally challenging data.

03

Integrating ATP into models enhances efficiency and accuracy.

Abstract

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanfordvl/atp-video-language
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling