Describing Videos by Exploiting Temporal Structure

Li Yao; Atousa Torabi; Kyunghyun Cho; Nicolas Ballas; Christopher Pal,; Hugo Larochelle; Aaron Courville

arXiv:1502.08029·stat.ML·October 2, 2015·189 cites

Describing Videos by Exploiting Temporal Structure

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal,, Hugo Larochelle, Aaron Courville

PDF

Open Access 5 Repos

TL;DR

This paper introduces a novel video description method that leverages 3-D CNNs for capturing short-term dynamics and a temporal attention mechanism for selecting relevant segments, achieving state-of-the-art results on multiple datasets.

Contribution

It combines 3-D CNN representations with a temporal attention mechanism to improve video description quality, addressing both local and global temporal structures.

Findings

01

Outperforms previous methods on BLEU and METEOR metrics.

02

Effective in selecting relevant temporal segments for description.

03

Demonstrates scalability on larger datasets.

Abstract

Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization