Temporal Contrastive Learning for Video Temporal Reasoning in Large   Vision-Language Models

Rafael Souza; Jia-Hao Lim; Alexander Davis

arXiv:2412.11391·cs.CV·December 17, 2024

Temporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models

Rafael Souza, Jia-Hao Lim, Alexander Davis

PDF

Open Access

TL;DR

This paper introduces TSADP, a novel framework that significantly improves temporal reasoning in large vision-language models by using dynamic prompts and contrastive learning, leading to better understanding of video sequences.

Contribution

The paper proposes TSADP, a new approach combining dynamic prompting and contrastive learning to enhance temporal reasoning in large vision-language models.

Findings

01

TSADP outperforms state-of-the-art models on VidSitu dataset

02

Improves tasks like entity association and temporal understanding

03

Human evaluations show better semantic coherence

Abstract

Temporal reasoning is a critical challenge in video-language understanding, as it requires models to align semantic concepts consistently across time. While existing large vision-language models (LVLMs) and large language models (LLMs) excel at static tasks, they struggle to capture dynamic interactions and temporal dependencies in video sequences. In this work, we propose Temporal Semantic Alignment via Dynamic Prompting (TSADP), a novel framework that enhances temporal reasoning capabilities through dynamic task-specific prompts and temporal contrastive learning. TSADP leverages a Dynamic Prompt Generator (DPG) to encode fine-grained temporal relationships and a Temporal Contrastive Loss (TCL) to align visual and textual embeddings across time. We evaluate our method on the VidSitu dataset, augmented with enriched temporal annotations, and demonstrate significant improvements over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsALIGN