Sequence to Sequence -- Video to Text

Subhashini Venugopalan; Marcus Rohrbach; Jeff Donahue; Raymond Mooney,; Trevor Darrell; Kate Saenko

arXiv:1505.00487·cs.CV·October 20, 2015·193 cites

Sequence to Sequence -- Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney,, Trevor Darrell, Kate Saenko

PDF

Open Access 4 Repos

TL;DR

This paper introduces an end-to-end sequence-to-sequence LSTM model that generates video descriptions by capturing temporal dynamics and variable-length sequences, trained on video-sentence pairs.

Contribution

The novel model effectively learns temporal structure and language modeling for video captioning, advancing open-domain video description methods.

Findings

01

Model achieves state-of-the-art performance on YouTube videos

02

Exploits different visual features for improved accuracy

03

Successfully generalizes across multiple datasets

Abstract

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory