ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura

TL;DR
ProMQA is a new multimodal dataset with 401 question-answer pairs based on procedural activities like cooking, designed to evaluate and advance systems' understanding of multimodal instructions and actions.
Contribution
The paper introduces ProMQA, a novel dataset for multimodal procedural question answering, created through a cost-effective human-LLM collaborative annotation process.
Findings
Current systems lag behind human performance on ProMQA.
Benchmark results highlight significant gaps in multimodal understanding.
ProMQA reveals new challenges in application-oriented multimodal understanding.
Abstract
Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, i.e., cooking, coupled with their corresponding instructions/recipes. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Occupational Health and Safety Research
MethodsSparse Evolutionary Training
