APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations
Elan Rosenfeld, Preetum Nakkiran, Hadi Pouransari, Oncel Tuzel,, Fartash Faghri

TL;DR
This paper presents APE, a method that efficiently aligns pretrained unimodal encoders to learn multimodal representations with significantly less data and training time, outperforming state-of-the-art methods in certain tasks.
Contribution
It introduces a simple, effective approach to align existing encoders using small auxiliary functions and curated data, reducing training resources and improving robustness.
Findings
Achieves competitive or superior performance on multimodal tasks.
Requires significantly less training data and time than traditional methods.
Surpasses prior state-of-the-art in ImageNet zero-shot classification.
Abstract
Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
