TL;DR
This paper introduces MG-VMoE, a novel multimodal graph-based variational mixture of experts network, for zero-shot multimodal information extraction that effectively captures fine-grained semantic correlations between text and images.
Contribution
It proposes a new MG-VMoE model that aligns multimodal representations using a graph-based variational mixture of experts and incorporates virtual adversarial training for improved zero-shot extraction.
Findings
Outperforms baseline models on benchmark datasets
Effectively captures fine-grained semantic correlations
Demonstrates superior zero-shot multimodal extraction performance
Abstract
Multimodal information extraction on social media is a series of fundamental tasks to construct the multimodal knowledge graph. The tasks aim to extract the structural information in free texts with the incorporate images, including: multimodal named entity typing and multimodal relation extraction. However, the growing number of multimodal data implies a growing category set and the newly emerged entity types or relations should be recognized without additional training. To address the aforementioned challenges, we focus on the zero-shot multimodal information extraction tasks which require using textual and visual modalities for recognizing unseen categories. Compared with text-based zero-shot information extraction models, the existing multimodal ones make the textual and visual modalities aligned directly and exploit various fusion strategies to improve their performances. But the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
