Show and Guide: Instructional-Plan Grounded Vision and Language Model

Diogo Gl\'oria-Silva; David Semedo; Jo\~ao Magalh\~aes

arXiv:2409.19074·cs.CV·October 22, 2024

Show and Guide: Instructional-Plan Grounded Vision and Language Model

Diogo Gl\'oria-Silva, David Semedo, Jo\~ao Magalh\~aes

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces MM-PlanLLM, a multimodal language model that integrates visual and textual information to improve guidance in complex instructional tasks, enabling retrieval of relevant video segments and generation of next steps based on visual progress.

Contribution

The work presents the first multimodal LLM for instructional plans, employing a novel multitask-multistage training approach to align visual and textual plan understanding.

Findings

01

Strong performance on multimodal and textual dialogue tasks

02

Effective cross-modal temporal and plan-structure representations

03

Successful retrieval and generation in instructional scenarios

Abstract

Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmgcsilva/mmplanllm
pytorchOfficial

Videos

Show and Guide: Instructional-Plan Grounded Vision and Language Model· underline

Taxonomy

TopicsEducation and Critical Thinking Development