Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension
Huibin Zhang, Zhengkun Zhang, Yao Zhang, Jun Wang, Yufan, Li, Ning jiang, Xin wei, Zhenglu Yang

TL;DR
This paper introduces a novel Temporal-Modal Entity Graph (TMEG) for fine-grained understanding of procedural multimodal documents, capturing entity relations across time and modalities to improve reasoning tasks.
Contribution
It proposes a new graph-based model that encodes entity relations in both temporal and cross-modal contexts for procedural multimodal comprehension.
Findings
TMEG outperforms baseline models on RecipeQA and CraftQA datasets.
The approach improves entity relation modeling in procedural multimodal documents.
Experimental results demonstrate enhanced reasoning capabilities.
Abstract
Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
