A Dataset for Movie Description
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

TL;DR
This paper introduces a new dataset of transcribed descriptive video service (DVS) aligned with full-length HD movies, enabling research in video description generation and comparison with movie scripts.
Contribution
The work presents a novel, large-scale dataset of DVS and scripts aligned with movies, and benchmarks different approaches for generating video descriptions.
Findings
DVS is more visual and accurate in describing on-screen content.
The dataset contains over 54,000 sentences from 72 movies.
DVS descriptions differ significantly from pre-written scripts.
Abstract
Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel corpus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
