Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives
Ji-jun Park, Soo-joon Choi

TL;DR
This paper introduces a novel framework that enhances video captioning by integrating causal and temporal reasoning modules into vision-language models, leading to more coherent and contextually accurate video descriptions.
Contribution
The paper proposes a Causal-Temporal Reasoning Module (CTRM) with Causal Dynamics Encoder and Temporal Relational Learner, improving video captioning by capturing causal and temporal dependencies.
Findings
Outperforms existing models on MSVD and MSR-VTT benchmarks.
Achieves higher scores in CIDEr, BLEU-4, and ROUGE-L metrics.
Produces more fluent and contextually relevant captions.
Abstract
Video captioning is a critical task in the field of multimodal machine learning, aiming to generate descriptive and coherent textual narratives for video content. While large vision-language models (LVLMs) have shown significant progress, they often struggle to capture the causal and temporal dynamics inherent in complex video sequences. To address this limitation, we propose an enhanced framework that integrates a Causal-Temporal Reasoning Module (CTRM) into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics Encoder (CDE) and the Temporal Relational Learner (TRL), which collectively encode causal dependencies and temporal consistency from video frames. We further design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets, fine-tuning on causally annotated data, and contrastive alignment for better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education
