Bidirectional Multirate Reconstruction for Temporal Modeling in Videos
Linchao Zhu, Zhongwen Xu, Yi Yang

TL;DR
This paper introduces an unsupervised bidirectional multirate reconstruction method for temporal modeling in videos, effectively handling motion speed variations and improving performance in event detection and captioning tasks.
Contribution
It proposes a novel multirate visual recurrent model with bidirectional reconstruction for unsupervised temporal learning in videos, addressing motion speed variance.
Findings
Achieves 10.4% improvement in event detection on MEDTest-13
Sets new state-of-the-art in video captioning on YouTube2Text
Effective in modeling temporal information with untrimmed videos
Abstract
Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions, i.e., presentpast transition and presentfuture transition, reflecting the temporal information in different views. The proposed method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
