Stories in the Eye: Contextual Visual Interactions for Efficient Video   to Language Translation

Anirudh Goyal; Marius Leordeanu

arXiv:1511.06674·cs.CV·November 23, 2015·1 cites

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Anirudh Goyal, Marius Leordeanu

PDF

Open Access

TL;DR

This paper introduces a hierarchical, vision-only method for translating videos into natural language, effectively capturing visual stories without relying on pre-established linguistic models, and achieves state-of-the-art results.

Contribution

It presents a novel hierarchical approach that learns visual story representations directly from data for video-to-language translation, bypassing traditional linguistic models.

Findings

01

Achieves state-of-the-art performance on YouTube video dataset.

02

Outperforms approaches using pre-learned linguistic knowledge.

03

Effectively captures visual stories using only visual cues.

Abstract

Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning