Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal   Spaces

Sarah Di; Robin Yu; Amol Kapoor

arXiv:2110.04599·cs.LG·October 12, 2021

Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal Spaces

Sarah Di, Robin Yu, Amol Kapoor

PDF

Open Access

TL;DR

This paper introduces a cost-effective method for multi-modal co-embedding that leverages pretrained models without gradient passing, enabling efficient integration of diverse data types like images and audio.

Contribution

It presents a novel heterogeneous transfer learning approach that simplifies multi-modal embedding by preprocessing with pretrained models, reducing training costs and complexity.

Findings

01

Proven effectiveness in joint image-audio embedding tasks

02

Reduces training costs by avoiding gradient backpropagation through pretrained models

03

Provides a framework for universal multi-modal embeddings

Abstract

Any general artificial intelligence system must be able to interpret, operate on, and produce data in a multi-modal latent space that can represent audio, imagery, text, and more. In the last decade, deep neural networks have seen remarkable success in unimodal data distributions, while transfer learning techniques have seen a massive expansion of model reuse across related domains. However, training multi-modal networks from scratch remains expensive and illusive, while heterogeneous transfer learning (HTL) techniques remain relatively underdeveloped. In this paper, we propose a novel and cost-effective HTL strategy for co-embedding multi-modal spaces. Our method avoids cost inefficiencies by preprocessing embeddings using pretrained models for all components, without passing gradients through these models. We prove the use of this system in a joint image-audio embedding task. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Animal Vocal Communication and Behavior