Multimodal Prototypical Networks for Few-shot Learning
Frederik Pahde, Mihai Puscas, Tassilo Klein, Moin Nabi

TL;DR
This paper introduces a cross-modal feature generation framework that leverages auxiliary text data to improve few-shot visual classification by enriching the feature space, outperforming existing methods on benchmark datasets.
Contribution
The paper proposes a novel generative approach to map text into visual features, enhancing prototype quality in few-shot learning scenarios with multimodal data.
Findings
Outperforms state-of-the-art few-shot learning methods on CUB-200 and Oxford-102 datasets.
Enables effective classification using only visual data at test time.
Demonstrates the benefit of auxiliary text data in low-data regimes.
Abstract
Although providing exceptional results for many computer vision tasks, state-of-the-art deep learning algorithms catastrophically struggle in low data scenarios. However, if data in additional modalities exist (e.g. text) this can compensate for the lack of data and improve the classification results. To overcome this data scarcity, we design a cross-modal feature generation framework capable of enriching the low populated embedding space in few-shot scenarios, leveraging data from the auxiliary modality. Specifically, we train a generative model that maps text data into the visual feature space to obtain more reliable prototypes. This allows to exploit data from additional modalities (e.g. text) during training while the ultimate task at test time remains classification with exclusively visual data. We show that in such cases nearest neighbor classification is a viable approach and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
