Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman,, Ralph Weischedel, Nanyun Peng

TL;DR
This paper benchmarks machine learning models' ability to sequence unordered multimodal instructions and introduces sequentiality-aware pretraining techniques that improve performance by over 5%.
Contribution
It provides a new dataset and human annotations for multimodal instruction sequencing and proposes pretraining methods that enhance model reasoning over multimodal procedures.
Findings
Models perform worse than humans in multimodal sequencing.
Current models struggle to utilize multimodal information effectively.
Sequentiality-aware pretraining improves performance by over 5%.
Abstract
The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
