VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Gl\'oria-Silva; David Semedo; Jo\~ao Maglh\~aes

arXiv:2602.19146·cs.CV·April 1, 2026

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Gl\'oria-Silva, David Semedo, Jo\~ao Maglh\~aes

PDF

1 Video

TL;DR

VIGiA is a multimodal dialogue model that understands and reasons over instructional videos by integrating visual inputs, plans, and user interactions, advancing conversational guidance in complex tasks.

Contribution

It introduces a multimodal plan reasoning and retrieval framework for dialogue models, enabling more accurate, grounded, and plan-aware interactions over instructional videos.

Findings

01

VIGiA outperforms existing models on all tasks in the dataset.

02

Achieves over 90% accuracy on plan-aware visual question answering.

03

Demonstrates effective reasoning over multimodal instructional plans.

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval· underline