APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal   Representations

Elan Rosenfeld; Preetum Nakkiran; Hadi Pouransari; Oncel Tuzel,; Fartash Faghri

arXiv:2210.03927·cs.LG·October 11, 2022

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Elan Rosenfeld, Preetum Nakkiran, Hadi Pouransari, Oncel Tuzel,, Fartash Faghri

PDF

Open Access

TL;DR

This paper presents APE, a method that efficiently aligns pretrained unimodal encoders to learn multimodal representations with significantly less data and training time, outperforming state-of-the-art methods in certain tasks.

Contribution

It introduces a simple, effective approach to align existing encoders using small auxiliary functions and curated data, reducing training resources and improving robustness.

Findings

01

Achieves competitive or superior performance on multimodal tasks.

02

Requires significantly less training data and time than traditional methods.

03

Surpasses prior state-of-the-art in ImageNet zero-shot classification.

Abstract

Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis