Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal Spaces
Sarah Di, Robin Yu, Amol Kapoor

TL;DR
This paper introduces a cost-effective method for multi-modal co-embedding that leverages pretrained models without gradient passing, enabling efficient integration of diverse data types like images and audio.
Contribution
It presents a novel heterogeneous transfer learning approach that simplifies multi-modal embedding by preprocessing with pretrained models, reducing training costs and complexity.
Findings
Proven effectiveness in joint image-audio embedding tasks
Reduces training costs by avoiding gradient backpropagation through pretrained models
Provides a framework for universal multi-modal embeddings
Abstract
Any general artificial intelligence system must be able to interpret, operate on, and produce data in a multi-modal latent space that can represent audio, imagery, text, and more. In the last decade, deep neural networks have seen remarkable success in unimodal data distributions, while transfer learning techniques have seen a massive expansion of model reuse across related domains. However, training multi-modal networks from scratch remains expensive and illusive, while heterogeneous transfer learning (HTL) techniques remain relatively underdeveloped. In this paper, we propose a novel and cost-effective HTL strategy for co-embedding multi-modal spaces. Our method avoids cost inefficiencies by preprocessing embeddings using pretrained models for all components, without passing gradients through these models. We prove the use of this system in a joint image-audio embedding task. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Animal Vocal Communication and Behavior
