Using Multiple Instance Learning to Build Multimodal Representations

Peiqi Wang; William M. Wells; Seth Berkowitz; Steven Horng; Polina; Golland

arXiv:2212.05561·cs.CV·June 14, 2023

Using Multiple Instance Learning to Build Multimodal Representations

Peiqi Wang, William M. Wells, Seth Berkowitz, Steven Horng, Polina, Golland

PDF

Open Access

TL;DR

This paper introduces a unified framework connecting multimodal representation learning with multiple instance learning, leading to a novel contrastive approach that achieves state-of-the-art results in medical image-text tasks.

Contribution

It establishes a generic, permutation-invariant framework for multimodal learning and derives a new contrastive method demonstrating superior performance.

Findings

01

Achieved state-of-the-art results in downstream tasks

02

Unified framework encompasses existing approaches

03

Proposed contrastive learning method outperforms baselines

Abstract

Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning