Sequence to Sequence -- Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney,, Trevor Darrell, Kate Saenko

TL;DR
This paper introduces an end-to-end sequence-to-sequence LSTM model that generates video descriptions by capturing temporal dynamics and variable-length sequences, trained on video-sentence pairs.
Contribution
The novel model effectively learns temporal structure and language modeling for video captioning, advancing open-domain video description methods.
Findings
Model achieves state-of-the-art performance on YouTube videos
Exploits different visual features for improved accuracy
Successfully generalizes across multiple datasets
Abstract
Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
