On the Limits of Multi-modal Meta-Learning with Auxiliary Task   Modulation Using Conditional Batch Normalization

Jordi Armengol-Estap\'e; Vincent Michalski; Ramnath Kumar; Pierre-Luc; St-Charles; Doina Precup; Samira Ebrahimi Kahou

arXiv:2405.18751·cs.CV·May 31, 2024

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

Jordi Armengol-Estap\'e, Vincent Michalski, Ramnath Kumar, Pierre-Luc, St-Charles, Doina Precup, Samira Ebrahimi Kahou

PDF

Open Access 1 Video

TL;DR

This paper investigates the potential of multi-modal meta-learning with language guidance for few-shot classification, revealing limited and inconsistent benefits and highlighting the importance of computational costs.

Contribution

It introduces a multi-modal architecture with a bridge network for semantic alignment, and provides critical insights into its limitations and the impact of added complexity.

Findings

01

Improvements are inconsistent across benchmarks.

02

Additional compute and parameters drive observed gains.

03

Semantic alignment via the bridge network has limited impact.

Abstract

Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization· underline

Taxonomy

TopicsSpeech Recognition and Synthesis