ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

Kimihiro Hasegawa; Wiradee Imrattanatrai; Zhi-Qi Cheng; Masaki Asada; Susan Holm; Yuran Wang; Ken Fukuda; Teruko Mitamura

arXiv:2410.22211·cs.CL·November 5, 2025

ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding

Kimihiro Hasegawa, Wiradee Imrattanatrai, Zhi-Qi Cheng, Masaki Asada, Susan Holm, Yuran Wang, Ken Fukuda, Teruko Mitamura

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

ProMQA is a new multimodal dataset with 401 question-answer pairs based on procedural activities like cooking, designed to evaluate and advance systems' understanding of multimodal instructions and actions.

Contribution

The paper introduces ProMQA, a novel dataset for multimodal procedural question answering, created through a cost-effective human-LLM collaborative annotation process.

Findings

01

Current systems lag behind human performance on ProMQA.

02

Benchmark results highlight significant gaps in multimodal understanding.

03

ProMQA reveals new challenges in application-oriented multimodal understanding.

Abstract

Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities, i.e., cooking, coupled with their corresponding instructions/recipes. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kimihiroh/promqa
pytorchOfficial

Datasets

kimihiroh/promqa-cooking-frames
dataset· 65 dl
65 dl

Videos

ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Occupational Health and Safety Research

MethodsSparse Evolutionary Training