VELOCITI: Benchmarking Video-Language Compositional Reasoning with   Strict Entailment

Darshana Saravanan; Varun Gupta; Darshan Singh; Zeeshan Khan; Vineet; Gandhi; Makarand Tapaswi

arXiv:2406.10889·cs.CV·April 1, 2025·1 cites

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet, Gandhi, Makarand Tapaswi

PDF

Open Access 1 Datasets

TL;DR

VELOCITI introduces a benchmark for evaluating video-language compositional reasoning, focusing on understanding agents and actions across short videos using strict entailment, revealing significant gaps between current models and human performance.

Contribution

The paper presents VELOCITI, a new benchmark with StrictVLE for assessing video-language reasoning, highlighting current model limitations and emphasizing the importance of visual context in compositional understanding.

Findings

01

Current models achieve less than 50% accuracy on VELOCITI.

02

Action understanding is weaker than agent recognition.

03

Negative captions with entities in videos are more challenging.

Abstract

A fundamental aspect of compositional reasoning in a video is associating people and their actions across time. Recent years have seen great progress in general-purpose vision or video models and a move towards long-video understanding. While exciting, we take a step back and ask: are current models good at compositional reasoning on short videos? To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events. We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption. We evaluate several models and observe that even the best, LLaVA-OneVision (44.5%) and Gemini-1.5-Pro (49.3%), are far from human accuracy at 93.0%. Results show that action understanding lags behind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

katha-ai-iiith/VELOCITI
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsFocus