Causality Matters: How Temporal Information Emerges in Video Language Models
Yumeng Shi, Quanyu Long, Yin Wu, Wenya Wang

TL;DR
This paper investigates how temporal understanding emerges in video language models, revealing that causal attention mechanisms, rather than positional encodings, are key to temporal reasoning, leading to more efficient model strategies.
Contribution
It uncovers the causal information pathway for temporal understanding in VideoLMs and proposes two efficiency strategies based on these insights.
Findings
Reversing frame sequences significantly impacts temporal understanding.
Positional encodings have minimal effect on temporal reasoning performance.
Causal attention implicitly encodes temporal structure.
Abstract
Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
