ICSVR: Investigating Compositional and Syntactic Understanding in Video   Retrieval Models

Avinash Madasu; Vasudev Lal

arXiv:2306.16533·cs.CV·June 12, 2024

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Avinash Madasu, Vasudev Lal

PDF

Open Access 2 Repos

TL;DR

This paper systematically evaluates how well video retrieval models understand compositional and syntactic components of text queries, revealing that object attributes are more influential than actions and syntax, and that image-text pre-trained models excel in understanding.

Contribution

It provides a comprehensive analysis of the impact of objects, attributes, actions, and syntax on video retrieval performance, highlighting the superiority of image-text pre-trained models in this domain.

Findings

01

Objects & attributes are more critical than actions and syntax.

02

Pre-trained image-text models outperform video-text models in understanding.

03

Syntax and actions have a minor effect on retrieval accuracy.

Abstract

Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsContrastive Language-Image Pre-training