On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization
Jordi Armengol-Estap\'e, Vincent Michalski, Ramnath Kumar, Pierre-Luc, St-Charles, Doina Precup, Samira Ebrahimi Kahou

TL;DR
This paper investigates the potential of multi-modal meta-learning with language guidance for few-shot classification, revealing limited and inconsistent benefits and highlighting the importance of computational costs.
Contribution
It introduces a multi-modal architecture with a bridge network for semantic alignment, and provides critical insights into its limitations and the impact of added complexity.
Findings
Improvements are inconsistent across benchmarks.
Additional compute and parameters drive observed gains.
Semantic alignment via the bridge network has limited impact.
Abstract
Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis
