TL;DR
MANGO introduces a novel, interpretable multimodal fusion method using normalizing flows and a new cross-attention mechanism, achieving state-of-the-art results across various multimodal tasks.
Contribution
The paper proposes a new invertible cross-attention layer and a normalizing flow-based model for explicit, scalable multimodal fusion learning, improving interpretability and performance.
Findings
Achieved state-of-the-art results on semantic segmentation, image translation, and genre classification.
Developed three new cross-attention mechanisms for complex multimodal correlations.
Demonstrated scalability to high-dimensional multimodal data.
Abstract
Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
