Show and Guide: Instructional-Plan Grounded Vision and Language Model
Diogo Gl\'oria-Silva, David Semedo, Jo\~ao Magalh\~aes

TL;DR
This paper introduces MM-PlanLLM, a multimodal language model that integrates visual and textual information to improve guidance in complex instructional tasks, enabling retrieval of relevant video segments and generation of next steps based on visual progress.
Contribution
The work presents the first multimodal LLM for instructional plans, employing a novel multitask-multistage training approach to align visual and textual plan understanding.
Findings
Strong performance on multimodal and textual dialogue tasks
Effective cross-modal temporal and plan-structure representations
Successful retrieval and generation in instructional scenarios
Abstract
Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducation and Critical Thinking Development
