Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual   Zeroshot Classification and Retrieval of Videos

Kranti Kumar Parida; Neeraj Matiyali; Tanaya Guha; Gaurav Sharma

arXiv:1910.08732·cs.CV·October 22, 2019

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

Kranti Kumar Parida, Neeraj Matiyali, Tanaya Guha, Gaurav Sharma

PDF

TL;DR

This paper introduces a multimodal approach combining audio and visual data for zero-shot video classification and retrieval, demonstrating improved performance and proposing a novel modality attention mechanism.

Contribution

It develops a joint multimodal embedding framework for zero-shot video tasks and introduces a semi-supervised modality attention network that generalizes to unseen classes.

Findings

01

Adding audio improves zero-shot classification and retrieval performance.

02

The modality attention network effectively predicts dominant modalities without extra labels.

03

Constructed a new large-scale multimodal video dataset with 33 classes.

Abstract

We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156,416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zeroshot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the `dominant' modality using a jointly learned modality attention network. We learn the attention in a semi-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest