Bridging Vision and Language: Modeling Causality and Temporality in   Video Narratives

Ji-jun Park; Soo-joon Choi

arXiv:2412.10720·cs.CV·December 17, 2024

Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives

Ji-jun Park, Soo-joon Choi

PDF

Open Access

TL;DR

This paper introduces a novel framework that enhances video captioning by integrating causal and temporal reasoning modules into vision-language models, leading to more coherent and contextually accurate video descriptions.

Contribution

The paper proposes a Causal-Temporal Reasoning Module (CTRM) with Causal Dynamics Encoder and Temporal Relational Learner, improving video captioning by capturing causal and temporal dependencies.

Findings

01

Outperforms existing models on MSVD and MSR-VTT benchmarks.

02

Achieves higher scores in CIDEr, BLEU-4, and ROUGE-L metrics.

03

Produces more fluent and contextually relevant captions.

Abstract

Video captioning is a critical task in the field of multimodal machine learning, aiming to generate descriptive and coherent textual narratives for video content. While large vision-language models (LVLMs) have shown significant progress, they often struggle to capture the causal and temporal dynamics inherent in complex video sequences. To address this limitation, we propose an enhanced framework that integrates a Causal-Temporal Reasoning Module (CTRM) into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics Encoder (CDE) and the Temporal Relational Learner (TRL), which collectively encode causal dependencies and temporal consistency from video frames. We further design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets, fine-tuning on causally annotated data, and contrastive alignment for better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education