TL;DR
This paper introduces a neural network approach to create joint audio-visual embeddings from large-scale video datasets, enabling cross-modal retrieval of audio and video content in an unsupervised manner.
Contribution
It presents a novel method for learning cross-modal embeddings from YouTube-8M videos, facilitating audio-visual retrieval without supervision.
Findings
Achieved promising Recall@K results on YouTube-8M subset
Demonstrated effective cross-modal retrieval between audio and visual data
Validated the potential of unsupervised joint embedding learning
Abstract
The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
