The Long-Short Story of Movie Description
Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

TL;DR
This paper advances movie description generation by developing robust visual classifiers from weak annotations and optimizing LSTM-based models, achieving state-of-the-art results on the MPII-MD dataset.
Contribution
It introduces a method to learn visual classifiers from weak sentence annotations and demonstrates improved LSTM-based description generation for movies.
Findings
Achieved the best performance on MPII-MD dataset to date.
Analyzed key challenges in movie description task.
Compared various design choices for LSTM training.
Abstract
Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these visual classifiers we learn how to generate a description using an LSTM. We explore different design choices to build and train the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
