Using Multiple Instance Learning to Build Multimodal Representations
Peiqi Wang, William M. Wells, Seth Berkowitz, Steven Horng, Polina, Golland

TL;DR
This paper introduces a unified framework connecting multimodal representation learning with multiple instance learning, leading to a novel contrastive approach that achieves state-of-the-art results in medical image-text tasks.
Contribution
It establishes a generic, permutation-invariant framework for multimodal learning and derives a new contrastive method demonstrating superior performance.
Findings
Achieved state-of-the-art results in downstream tasks
Unified framework encompasses existing approaches
Proposed contrastive learning method outperforms baselines
Abstract
Image-text multimodal representation learning aligns data across modalities and enables important medical applications, e.g., image classification, visual grounding, and cross-modal retrieval. In this work, we establish a connection between multimodal representation learning and multiple instance learning. Based on this connection, we propose a generic framework for constructing permutation-invariant score functions with many existing multimodal representation learning approaches as special cases. Furthermore, we use the framework to derive a novel contrastive learning approach and demonstrate that our method achieves state-of-the-art results in several downstream tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
