Learning Language-Visual Embedding for Movie Understanding with Natural-Language
Atousa Torabi, Niket Tandon, Leonid Sigal

TL;DR
This paper develops and evaluates three joint language-visual neural network models for movie understanding, demonstrating their effectiveness on large-scale datasets through tasks like video annotation, retrieval, and a novel multiple-choice test.
Contribution
It introduces three new neural network architectures for joint language-visual embedding and a novel multiple-choice evaluation method for movie understanding.
Findings
Best model achieves 19.2% Recall@10 on annotation task
Best model achieves 18.9% Recall@10 on retrieval task
Model attains 58.11% accuracy on the multiple-choice test
Abstract
Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
