AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing   Label Features from Multi-Modal Embeddings

Pratik Mazumder; Pravendra Singh; Kranti Kumar Parida; Vinay P.; Namboodiri

arXiv:2005.13402·cs.CV·November 24, 2020

AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings

Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, Vinay P., Namboodiri

PDF

TL;DR

AVGZSLNet introduces a multi-modal zero-shot learning method that reconstructs label features from audio and video embeddings, effectively handling unseen classes and missing modalities during testing.

Contribution

The paper presents a novel multi-modal zero-shot learning framework using a cross-modal decoder and triplet loss to improve class embedding alignment and handle missing modalities.

Findings

01

Outperforms existing models in zero-shot classification and retrieval tasks.

02

Effective even when one modality is missing at test time.

03

Validated through extensive ablation studies.

Abstract

In this paper, we propose a novel approach for generalized zero-shot learning in a multi-modal setting, where we have novel classes of audio/video during testing that are not seen during training. We use the semantic relatedness of text embeddings as a means for zero-shot learning by aligning audio and video embeddings with the corresponding class label text feature space. Our approach uses a cross-modal decoder and a composite triplet loss. The cross-modal decoder enforces a constraint that the class label text features can be reconstructed from the audio and video embeddings of data points. This helps the audio and video embeddings to move closer to the class label text embedding. The composite triplet loss makes use of the audio, video, and text embeddings. It helps bring the embeddings from the same class closer and push away the embeddings from different classes in a multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTriplet Loss