MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain   Everyday Tasks

Jingyuan Qi; Minqian Liu; Ying Shen; Zhiyang Xu; Lifu Huang

arXiv:2310.04965·cs.CL·January 22, 2024

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Jingyuan Qi, Minqian Liu, Ying Shen, Zhiyang Xu, Lifu Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MultiScript, a new benchmark for multimodal script learning from videos and text, enabling AI systems to generate and predict steps for everyday tasks across diverse domains.

Contribution

It presents a novel benchmark with two tasks for multimodal script learning and proposes knowledge-guided generative frameworks leveraging large language models.

Findings

01

Proposed frameworks outperform baseline models.

02

MultiScript covers over 6,655 tasks across 19 domains.

03

Significant improvement in script generation and step prediction accuracy.

Abstract

Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge -- MultiScript, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vt-nlp/multiscript
noneOfficial

Videos

MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling