ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
Avinash Madasu, Vasudev Lal

TL;DR
This paper systematically evaluates how well video retrieval models understand compositional and syntactic components of text queries, revealing that object attributes are more influential than actions and syntax, and that image-text pre-trained models excel in understanding.
Contribution
It provides a comprehensive analysis of the impact of objects, attributes, actions, and syntax on video retrieval performance, highlighting the superiority of image-text pre-trained models in this domain.
Findings
Objects & attributes are more critical than actions and syntax.
Pre-trained image-text models outperform video-text models in understanding.
Syntax and actions have a minor effect on retrieval accuracy.
Abstract
Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
