Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models
Rafael Souza, Jia-Hao Lim, Alexander Davis

TL;DR
This paper introduces TSADP, a novel framework that significantly improves temporal reasoning in large vision-language models by using dynamic prompts and contrastive learning, leading to better understanding of video sequences.
Contribution
The paper proposes TSADP, a new approach combining dynamic prompting and contrastive learning to enhance temporal reasoning in large vision-language models.
Findings
TSADP outperforms state-of-the-art models on VidSitu dataset
Improves tasks like entity association and temporal understanding
Human evaluations show better semantic coherence
Abstract
Temporal reasoning is a critical challenge in video-language understanding, as it requires models to align semantic concepts consistently across time. While existing large vision-language models (LVLMs) and large language models (LLMs) excel at static tasks, they struggle to capture dynamic interactions and temporal dependencies in video sequences. In this work, we propose Temporal Semantic Alignment via Dynamic Prompting (TSADP), a novel framework that enhances temporal reasoning capabilities through dynamic task-specific prompts and temporal contrastive learning. TSADP leverages a Dynamic Prompt Generator (DPG) to encode fine-grained temporal relationships and a Temporal Contrastive Loss (TCL) to align visual and textual embeddings across time. We evaluate our method on the VidSitu dataset, augmented with enriched temporal annotations, and demonstrate significant improvements over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsALIGN
