Understanding Multimodal Procedural Knowledge by Sequencing Multimodal   Instructional Manuals

Te-Lin Wu; Alex Spangher; Pegah Alipoormolabashi; Marjorie Freedman,; Ralph Weischedel; Nanyun Peng

arXiv:2110.08486·cs.CL·February 22, 2024

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman,, Ralph Weischedel, Nanyun Peng

PDF

Open Access

TL;DR

This paper benchmarks machine learning models' ability to sequence unordered multimodal instructions and introduces sequentiality-aware pretraining techniques that improve performance by over 5%.

Contribution

It provides a new dataset and human annotations for multimodal instruction sequencing and proposes pretraining methods that enhance model reasoning over multimodal procedures.

Findings

01

Models perform worse than humans in multimodal sequencing.

02

Current models struggle to utilize multimodal information effectively.

03

Sequentiality-aware pretraining improves performance by over 5%.

Abstract

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications