Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation
Anirudh Goyal, Marius Leordeanu

TL;DR
This paper introduces a hierarchical, vision-only method for translating videos into natural language, effectively capturing visual stories without relying on pre-established linguistic models, and achieves state-of-the-art results.
Contribution
It presents a novel hierarchical approach that learns visual story representations directly from data for video-to-language translation, bypassing traditional linguistic models.
Findings
Achieves state-of-the-art performance on YouTube video dataset.
Outperforms approaches using pre-learned linguistic knowledge.
Effectively captures visual stories using only visual cues.
Abstract
Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
