TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR
TemporalVLM is a novel video large language model designed for temporal reasoning and detailed understanding of long videos, integrating a visual encoder, LSTMs, and a new dataset for evaluation.
Contribution
We introduce TemporalVLM, the first to incorporate LSTMs into video LLMs, along with a new dataset for long video temporal reasoning tasks.
Findings
TemporalVLM outperforms previous methods in multiple video understanding tasks.
The model effectively combines local and global temporal cues.
The IndustryASM dataset supports detailed temporal analysis in industrial videos.
Abstract
We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
