Learning Multimodal Word Representation via Dynamic Fusion Methods
Shaonan Wang, Jiajun Zhang, Chengqing Zong

TL;DR
This paper introduces three dynamic fusion methods for multimodal word representations, allowing the model to adaptively weight different modalities based on word types, leading to improved semantic understanding.
Contribution
It proposes novel dynamic fusion techniques that assign importance weights to modalities, enhancing multimodal word representations over existing models.
Findings
Proposed methods outperform unimodal baselines.
Proposed methods outperform state-of-the-art multimodal models.
Dynamic weighting improves semantic representation quality.
Abstract
Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is obvious that information from different modalities contributes differently to the meaning of words. This motivates us to build a multimodal model that can dynamically fuse the semantic representations from different modalities according to different types of words. To that end, we propose three novel dynamic fusion methods to assign importance weights to each modality, in which weights are learned under the weak supervision of word association pairs. The extensive experiments have demonstrated that the proposed methods outperform strong unimodal baselines and state-of-the-art multimodal models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
