Mining for meaning: from vision to language through multiple networks   consensus

Iulia Duta; Andrei Liviu Nicolicioiu; Simion-Vlad Bogolin; Marius; Leordeanu

arXiv:1806.01954·cs.CV·May 26, 2020·1 cites

Mining for meaning: from vision to language through multiple networks consensus

Iulia Duta, Andrei Liviu Nicolicioiu, Simion-Vlad Bogolin, Marius, Leordeanu

PDF

Open Access

TL;DR

This paper introduces a multi-network consensus approach for translating videos into natural language, leveraging diverse features and models to improve semantic accuracy and achieve state-of-the-art results on MSR-VTT.

Contribution

It proposes a novel consensus-based method using multiple encoder-decoder networks and diverse features for video captioning, enhancing semantic understanding.

Findings

01

Achieved state-of-the-art results on MSR-VTT dataset.

02

Demonstrated the effectiveness of consensus among multiple models.

03

Improved semantic accuracy in video descriptions.

Abstract

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization