Learning Language-Visual Embedding for Movie Understanding with   Natural-Language

Atousa Torabi; Niket Tandon; Leonid Sigal

arXiv:1609.08124·cs.CV·September 27, 2016·72 cites

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Atousa Torabi, Niket Tandon, Leonid Sigal

PDF

Open Access

TL;DR

This paper develops and evaluates three joint language-visual neural network models for movie understanding, demonstrating their effectiveness on large-scale datasets through tasks like video annotation, retrieval, and a novel multiple-choice test.

Contribution

It introduces three new neural network architectures for joint language-visual embedding and a novel multiple-choice evaluation method for movie understanding.

Findings

01

Best model achieves 19.2% Recall@10 on annotation task

02

Best model achieves 18.9% Recall@10 on retrieval task

03

Model attains 58.11% accuracy on the multiple-choice test

Abstract

Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition