Multimedia Generative Script Learning for Task Planning

Qingyun Wang; Manling Li; Hou Pong Chan; Lifu Huang; Julia Hockenmaier; Girish Chowdhary; Heng Ji

arXiv:2208.12306·cs.CL·June 11, 2025·1 cites

Multimedia Generative Script Learning for Task Planning

Qingyun Wang, Manling Li, Hou Pong Chan, Lifu Huang, Julia Hockenmaier, Girish Chowdhary, Heng Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new multimedia script learning task that generates task steps by integrating visual and textual states, along with a comprehensive benchmark and a novel model addressing visual, induction, and diversity challenges.

Contribution

The paper presents the first benchmark for multimedia script learning and proposes a model combining visual encoding, retrieval-augmented decoding, and contrastive learning for diverse, inductive task generation.

Findings

01

Our model outperforms baselines in generating accurate and diverse task steps.

02

The benchmark provides a new standard for evaluating multimedia script learning.

03

Visual state encoding improves the understanding of task progress.

Abstract

Goal-oriented generative script learning aims to generate subsequent steps to reach a particular goal, which is an essential task to assist robots or humans in performing stereotypical activities. An important aspect of this process is the ability to capture historical states visually, which provides detailed information that is not covered by text and will guide subsequent steps. Therefore, we propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities, as well as presenting the first benchmark containing 5,652 tasks and 79,089 multimedia steps. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EagleW/Multimedia-Generative-Script-Learning-for-Task-Planning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Subtitles and Audiovisual Media

MethodsContrastive Learning