Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder with Semantic Concepts
Nihar Bendre, Kevin Desai, Peyman Najafirad

TL;DR
This paper introduces a Multimodal Variational Auto-Encoder that learns a shared latent space from image features and semantic data, improving generalized zero-shot learning by leveraging local and global semantic knowledge.
Contribution
It proposes a novel M-VAE model that integrates multimodal data into a shared latent space with a multi-modal loss, enhancing zero-shot learning performance.
Findings
Outperforms state-of-the-art methods on four benchmark datasets.
Effectively correlates modalities to improve novel sample prediction.
Utilizes local and global semantic knowledge for better generalization.
Abstract
With the ever-increasing amount of data, the central challenge in multimodal learning involves limitations of labelled samples. For the task of classification, techniques such as meta-learning, zero-shot learning, and few-shot learning showcase the ability to learn information about novel classes based on prior knowledge. Recent techniques try to learn a cross-modal mapping between the semantic space and the image space. However, they tend to ignore the local and global semantic knowledge. To overcome this problem, we propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space. In our approach we concatenate multimodal data to a single embedding before passing it to the VAE for learning the latent space. We propose the use of a multi-modal loss during the reconstruction of the feature embedding through the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
