RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

Yongkang Jin; Jianwen Luo; Jingjing Wang; Jianmin Yao; Yu Hong

arXiv:2602.13748·cs.CL·February 17, 2026

RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong

PDF

Open Access

TL;DR

This paper introduces RMPL, a relation-aware multi-task progressive learning framework that enhances multimedia event extraction by leveraging heterogeneous supervision and stage-wise training, especially effective under low-resource conditions.

Contribution

It presents a novel multi-task progressive learning approach that explicitly models relations and utilizes heterogeneous supervision for improved multimedia event extraction.

Findings

01

RMPL outperforms existing methods on the M2E2 benchmark.

02

Stage-wise training improves event and argument extraction accuracy.

03

Incorporating heterogeneous supervision enhances multimodal grounding.

Abstract

Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques