VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of   Video-Language Models

Shicheng Li; Lei Li; Shuhuai Ren; Yuanxin Liu; Yi Liu; Rundong Gao; Xu; Sun; Lu Hou

arXiv:2311.17404·cs.CV·September 24, 2024·1 cites

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu, Sun, Lu Hou

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VITATECS is a diagnostic dataset designed to evaluate and improve the temporal concept understanding of video-language models, addressing limitations of existing benchmarks that rely on static visual cues.

Contribution

The paper introduces a new dataset with a taxonomy of temporal concepts and counterfactual descriptions, enabling precise assessment of temporal understanding in VidLMs.

Findings

01

Current models show deficiencies in temporal understanding.

02

The dataset reveals the need for models to better grasp temporal aspects.

03

Counterfactual descriptions help disentangle static and temporal information.

Abstract

The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lscpku/vitatecs
pytorchOfficial

Datasets

lscpku/VITATECS
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Human Pose and Action Recognition