Temporal Relational Reasoning in Videos
Bolei Zhou, Alex Andonian, Aude Oliva, Antonio Torralba

TL;DR
This paper introduces the Temporal Relation Network (TRN), an interpretable module that enhances neural networks' ability to learn and reason about temporal dependencies in videos across multiple time scales, improving activity recognition.
Contribution
The paper presents the TRN module, a novel approach for temporal relational reasoning in videos, demonstrating its effectiveness across multiple datasets and outperforming existing methods.
Findings
TRN improves activity recognition accuracy in video datasets.
TRN enables models to learn interpretable visual common sense.
TRN outperforms two-stream and 3D convolution networks.
Abstract
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets - Something-Something, Jester, and Charades - which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
Methods3D Convolution · Convolution
