DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention
Shunsuke Kitada, Yuki Iwazaki, Riku Togashi, Hitoshi Iyatomi

TL;DR
This paper introduces DM$^2$S$^2$, a novel deep learning framework that models multimodal data as sequence sets with hierarchical attention, improving interpretability and performance over traditional mid-fusion methods.
Contribution
The paper proposes a set-aware multimodal learning approach with hierarchical attention mechanisms, addressing issues of high dimensionality and missing modalities in mid-fusion models.
Findings
Performance comparable or superior to previous models.
Visualization of attention weights offers interpretability.
Effective handling of multiple modalities with set-based approach.
Abstract
There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DMS). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies · Multimodal Machine Learning Applications
