Mining for meaning: from vision to language through multiple networks consensus
Iulia Duta, Andrei Liviu Nicolicioiu, Simion-Vlad Bogolin, Marius, Leordeanu

TL;DR
This paper introduces a multi-network consensus approach for translating videos into natural language, leveraging diverse features and models to improve semantic accuracy and achieve state-of-the-art results on MSR-VTT.
Contribution
It proposes a novel consensus-based method using multiple encoder-decoder networks and diverse features for video captioning, enhancing semantic understanding.
Findings
Achieved state-of-the-art results on MSR-VTT dataset.
Demonstrated the effectiveness of consensus among multiple models.
Improved semantic accuracy in video descriptions.
Abstract
Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
