A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer

TL;DR
This paper explores how vision-language models understand and represent time, introducing a new benchmark dataset and methods to extract explicit timelines from model embeddings for improved temporal reasoning.
Contribution
It introduces TIME10k, a benchmark dataset for temporal evaluation of VLMs, and proposes methods to derive explicit timeline representations from their embeddings.
Findings
Temporal information in VLMs is structured on a low-dimensional, non-linear manifold.
Proposed timeline methods outperform prompt-based baselines in temporal reasoning tasks.
The approach is computationally efficient and effective for modeling time in VLMs.
Abstract
Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
