Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across   Modalities

Hammad A. Ayyubi; Christopher Thomas; Lovish Chum; Rahul Lokesh; Long; Chen; Yulei Niu; Xudong Lin; Xuande Feng; Jaywon Koo; Sounak Ray; Shih-Fu; Chang

arXiv:2206.07207·cs.CV·December 21, 2023·1 cites

Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

Hammad A. Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long, Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu, Chang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel task of extracting hierarchical event structures across modalities, supported by a new dataset and a weakly supervised model, advancing understanding of complex multimedia event relationships.

Contribution

It proposes the task of multimodal event hierarchy extraction, introduces the MultiHiEve dataset with rich event hierarchies, and develops a weakly supervised model to improve performance.

Findings

01

State-of-the-art models underperform on the new task.

02

The proposed weakly supervised model outperforms baselines.

03

MultiHiEve dataset enables research on hierarchical multimodal events.

Abstract

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsBalanced Selection