Jointly Modeling Embedding and Translation to Bridge Video and Language
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui

TL;DR
This paper introduces LSTM-E, a unified framework that jointly models visual content and sentence semantics to improve automatic video description, achieving state-of-the-art results on the YouTube2Text dataset.
Contribution
The paper proposes a novel joint embedding and LSTM framework that simultaneously learns visual-semantic relationships and sequence generation for video captioning.
Findings
Achieves 45.3% BLEU@4 and 31.0% METEOR on YouTube2Text.
Outperforms existing methods in SVO triplet prediction.
Demonstrates the effectiveness of joint embedding in video captioning.
Abstract
Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Jointly Modeling Embedding and Translation to Bridge Video and Language· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
