Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek

TL;DR
This paper investigates the temporal understanding of video-language models, identifies their limitations in grasping simple time relations, and proposes a lightweight adaptation method to improve their temporal awareness without retraining from scratch.
Contribution
It demonstrates the difficulty of existing models in understanding basic temporal relations and introduces a post-pretraining adaptation approach to enhance their temporal sense efficiently.
Findings
Existing models struggle with simple temporal relations.
Post-pretraining improves temporal understanding in downstream tasks.
Encouraging performance gains in tasks requiring high temporal awareness.
Abstract
Modelling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that seven existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
