Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi; Quanyu Long; Yin Wu; Wenya Wang

arXiv:2508.11576·cs.CV·November 18, 2025

Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi, Quanyu Long, Yin Wu, Wenya Wang

PDF

Open Access

TL;DR

This paper investigates how temporal understanding emerges in video language models, revealing that causal attention mechanisms, rather than positional encodings, are key to temporal reasoning, leading to more efficient model strategies.

Contribution

It uncovers the causal information pathway for temporal understanding in VideoLMs and proposes two efficiency strategies based on these insights.

Findings

01

Reversing frame sequences significantly impacts temporal understanding.

02

Positional encodings have minimal effect on temporal reasoning performance.

03

Causal attention implicitly encodes temporal structure.

Abstract

Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling