Describing Videos by Exploiting Temporal Structure
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal,, Hugo Larochelle, Aaron Courville

TL;DR
This paper introduces a novel video description method that leverages 3-D CNNs for capturing short-term dynamics and a temporal attention mechanism for selecting relevant segments, achieving state-of-the-art results on multiple datasets.
Contribution
It combines 3-D CNN representations with a temporal attention mechanism to improve video description quality, addressing both local and global temporal structures.
Findings
Outperforms previous methods on BLEU and METEOR metrics.
Effective in selecting relevant temporal segments for description.
Demonstrates scalability on larger datasets.
Abstract
Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
