Cross-modal Embeddings for Video and Audio Retrieval

Didac Sur\'is; Amanda Duarte; Amaia Salvador; Jordi Torres; Xavier; Gir\'o-i-Nieto

arXiv:1801.02200·cs.IR·January 9, 2018

Cross-modal Embeddings for Video and Audio Retrieval

Didac Sur\'is, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier, Gir\'o-i-Nieto

PDF

1 Repo

TL;DR

This paper introduces a neural network approach to create joint audio-visual embeddings from large-scale video datasets, enabling cross-modal retrieval of audio and video content in an unsupervised manner.

Contribution

It presents a novel method for learning cross-modal embeddings from YouTube-8M videos, facilitating audio-visual retrieval without supervision.

Findings

01

Achieved promising Recall@K results on YouTube-8M subset

02

Demonstrated effective cross-modal retrieval between audio and visual data

03

Validated the potential of unsupervised joint embedding learning

Abstract

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

surisdi/youtube-8m
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.