CLARGA: Multimodal Graph Representation Learning over Arbitrary Sets of Modalities
Santosh Patapati

TL;DR
CLARGA is a versatile multimodal fusion architecture that constructs sample-specific graphs for efficient, adaptive, and robust multimodal representation learning across diverse tasks and datasets.
Contribution
It introduces a general-purpose, graph-based multimodal fusion framework that adapts to any number and type of modalities without modifications.
Findings
Outperforms baselines and state-of-the-art models on 7 diverse datasets.
Demonstrates robustness to missing modality inputs.
Efficiently scales with the number of modalities due to sub-quadratic complexity.
Abstract
We introduce CLARGA, a general-purpose multimodal fusion architecture for multimodal representation learning that works with any number and type of modalities without changing the underlying framework. Given a supervised dataset, CLARGA can be applied to virtually any machine learning task to fuse different multimodal representations for processing by downstream layers. On a sample-by-sample basis, CLARGA learns how modalities should inform one another by building an attention weighted graph over their features and passing messages along this graph with a multi-head Graph Attention Network. Not only does this make CLARGA highly adaptive, as it constructs unique graphs for different samples, it makes for efficient fusion with sub-quadratic complexity as the number of modalities grows. Through a learnable mask, it can also adapt to missing modality inputs. The model is trained with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · Emotion and Mood Recognition
