Deep Multi-Modal Sets
Austin Reiter, Menglin Jia, Pu Yang, Ser-Nam Lim

TL;DR
This paper introduces Deep Multi-Modal Sets, a novel approach that models multiple data modalities as an unordered set, improving scalability, interpretability, and performance in multi-modal learning tasks.
Contribution
It proposes a set-based representation for multi-modal data that is permutation-invariant and scalable, addressing limitations of traditional concatenation methods.
Findings
Achieved state-of-the-art results on Ads-Parallelity dataset.
Achieved state-of-the-art results on MM-IMDb dataset.
Demonstrated interpretability of feature contributions during inference.
Abstract
Many vision-related tasks benefit from reasoning over multiple modalities to leverage complementary views of data in an attempt to learn robust embedding spaces. Most deep learning-based methods rely on a late fusion technique whereby multiple feature types are encoded and concatenated and then a multi layer perceptron (MLP) combines the fused embedding to make predictions. This has several limitations, such as an unnatural enforcement that all features be present at all times as well as constraining only a constant number of occurrences of a feature modality at any given time. Furthermore, as more modalities are added, the concatenated embedding grows. To mitigate this, we propose Deep Multi-Modal Sets: a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector. The set is constructed so that we have invariance both to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
