MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
Philipp Seeberger, Dominik Wagner, Korbinian Riedhammer

TL;DR
This paper introduces MMUTF, a unified template filling model that leverages textual prompts to connect modalities, significantly improving multimedia event argument extraction performance over existing methods.
Contribution
The paper proposes a novel unified template filling approach that effectively integrates textual and visual modalities for multimedia event argument extraction.
Findings
Surpasses SOTA on textual EAE by +7% F1 score.
Outperforms second-best systems in multimedia EAE.
Demonstrates effectiveness on the M2E2 benchmark.
Abstract
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event templates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsFocus
