Jointly Modeling Embedding and Translation to Bridge Video and Language

Yingwei Pan; Tao Mei; Ting Yao; Houqiang Li; Yong Rui

arXiv:1505.01861·cs.CV·June 5, 2015·29 cites

Jointly Modeling Embedding and Translation to Bridge Video and Language

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui

PDF

Open Access 1 Video

TL;DR

This paper introduces LSTM-E, a unified framework that jointly models visual content and sentence semantics to improve automatic video description, achieving state-of-the-art results on the YouTube2Text dataset.

Contribution

The paper proposes a novel joint embedding and LSTM framework that simultaneously learns visual-semantic relationships and sequence generation for video captioning.

Findings

01

Achieves 45.3% BLEU@4 and 31.0% METEOR on YouTube2Text.

02

Outperforms existing methods in SVO triplet prediction.

03

Demonstrates the effectiveness of joint embedding in video captioning.

Abstract

Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Jointly Modeling Embedding and Translation to Bridge Video and Language· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory